marklogic / marklogic-contentpump

MarkLogic Contentpump (mlcp)
http://developer.marklogic.com/products/mlcp
Apache License 2.0
34 stars 26 forks source link

support for -custom-uri on import #443

Open starhound opened 1 month ago

starhound commented 1 month ago

Please add support for custom uri designation for importing compressed manuals or enable a proper override in the default URI labeling behaviors.

I have followed: https://docs.marklogic.com/guide/mlcp-guide/en/importing-content-into-marklogic-server/controlling-database-uris-during-ingestion/transforming-the-default-uri.html

Various configurations do not work, perhaps I have the incorrect syntax, it is not obvious from the documentation what version of regular expressions is utilized, (PERL, PCRE?).

I have a need to import compressed manuals (all .zip but possibly gzip later), but need to do minor alterations to the URI as it's including the .zip extension by default.

java.lang.IllegalArgumentException: Invalid option argument for output_uri_replace :Boeing 777 Test Manual.zip,TESTPATH43/Boeing_777_Test_Manual

My filename is Boeing 777 Test Manual.zip and it needs to become /USER_INPUT_ROOT/MANUAL_NAME/<files>.

I have a python api thats acting as a wrapper for MLCP and it functions entirely without issue except for this behavior.

def import_data(database, root_path, files, marklogic_connection):
    for file in files:
        #convert spaces in file.filename to underscores
        file_name = file.filename.replace(" ", "_")
        # remove file extension from the end of string
        file_name = file_name.split(".")[0]
        # if root_path has a trailing slash, do nothing, else add a trailing slash
        root_path = root_path if root_path.endswith("/") else f"{root_path}/"
        # if root_path has starting slash, remove it
        root_path = root_path[1:] if root_path.startswith("/") else root_path
        file_uri = f"{root_path}{file_name}"
        print(f"Importing {file.filename} to {file_uri}")
        cmd = [
            MLCP,
            "import",
            f"-host {marklogic_connection['host']}",
            f"-port {marklogic_connection['port']}",
            f"-database {database}",
            f"-username {marklogic_connection['username']}",
            f"-password {marklogic_connection['password']}",
            "-input_compressed true",
            "-mode local",
            "-base_path /",
            "-input_compression_codec zip",
            "-ssl false",
            f"-input_file_path '/tmp/{file.filename}'",
            f"-output_uri_replace '{file.filename},{file_uri}'"
        ]
        invoke_mlcp(cmd)
        # remove the file from the /tmp directory
        subprocess.run(["rm", f"/tmp/{file.filename}"])

invoke_mlcp() simply invokes the bash script provided on a subprocess.

Thank you

starhound commented 1 month ago

A temporary solution for us, and maybe other users, could be to utilize the python client library to update our URI's after uploading.

starhound commented 1 month ago

My apologies, you clearly state the syntax here but it is just difficult to find: https://docs.marklogic.com/guide/mlcp-guide/en/introduction-to-marklogic-content-pump/understanding-the-mlcp-command-line/regular-expression-syntax.html

Will continue trying to get my desired functionality.

starhound commented 1 month ago

The solution to my issue was described here: https://stackoverflow.com/a/72952214

The output_uri_replace needs to be encased in double quotes.

My request is now to expand your documentation on MLCP regarding this functionality as I've wasted a day of development time on this effort.

abika5 commented 1 month ago

Hi @starhound, Thanks for filing the issue, I will take a look and triage it for documentation.

yunzvanessa commented 3 weeks ago

@abika5 Please create a ticket in Jira and address it in the next sprint.

Thanks, Vanessa