mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
984 stars 344 forks source link

Can't run mallet via python due to FileNotFoundError #207

Closed afristo closed 1 year ago

afristo commented 1 year ago

I'm using python 3.7 on a docker container and trying to run Mallet within the container. We are using a python wrapper to send the commands to Mallet (shown below):

    template = "/Mallet-202108/bin/mallet import-file \
                --input /input_data/mallet_corpus.txt \
                --output /input_data/corpus.mallet \
                --keep-sequence \
                --remove-stopwords {remove_stopwords} \
                --line-regex '^(\S*)[\s,]*(.*)$' \
                --name 1 --label 0 --data 2"
    mallet_call = template.format(remove_stopwords = config['lda_params']['remove_stopwords'])

    remove_stopwords = config['lda_params']['remove_stopwords']
    template = f"/Mallet-202108/bin/mallet import-file --input /input_data/mallet_corpus.txt --output /input_data/corpus.mallet --keep-sequence --remove-stopwords {remove_stopwords} --line-regex '^(\S*)[\s,]*(.*)$' --name 1 --label 0 --data 2"

    logging.info(mallet_call)
    mallet_call = shlex.split(mallet_call)
    subprocess.run(mallet_call, shell=True)

I am getting the following error when I try to run this call: import-file: 1: /Mallet-202108/bin/mallet: not found

I checked the file permissions for the "mallet" binary using ls -lha and I have read, write and execute.

The filepath is also correct (verified with pwd and ls commands). I've tried changing the type of docker image, using the relative and absolute filepath and even changing the string for the command a bunch of different ways with no luck. Is there any other reason I am unable to execute the mallet binary file?

mimno commented 1 year ago

You might try copying the command line from the template string directly to a terminal? This would help determine if it's a path issue or a python issue. To confirm: the paths start with "/", are Mallet-202108 and input_data both directories at the root of the filesystem, not the current working directory?

afristo commented 1 year ago

You might try copying the command line from the template string directly to a terminal? This would help determine if it's a path issue or a python issue. To confirm: the paths start with "/", are Mallet-202108 and input_data both directories at the root of the filesystem, not the current working directory?

Yep! They are both located at the root of the filesystem, as is the python script running the commands.

I did log the template string before sending it to subprocess.run() and it is correct.

mimno commented 1 year ago

My guess is it's something going wrong with the subprocess call. I'm not really sure what the error means and if it's interpreting 1 or import-file as a file or command. If it's not a java issue I can't really help. If you figure it out and have a good example, please post it!

afristo commented 1 year ago

I fixed it by compiling the docker container again and removing the shell=True argument. I don't think the issue was the subprocess command, but something with the underlying binaries.

I pulled the repository down, recompiled everything and it ran flawlessly