Process output step still failing

kf-cuanschutz commented 3 months ago

Hi I am unable to run the process output step please see my output below: Are you familiar with that kind of error?

`` Loading JSON file specifying where colabfold results are located: cdc11_output.json Loading data with 4 workers Processing results... gene_name: CDC11 condition: 30aa_monomer_CDC11 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py", line 28, in load_confidence_data start,end = int(start),int(end) ValueError: invalid literal for int() with base 10: 'colabfold' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py", line 208, in main(args) File "/projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py", line 159, in main confidence_df = get_confidence_dataframe(path,n_workers) File "/projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py", line 72, in get_confidence_dataframe data = pool.map(func=load_confidence_data,iterable=all_paths,chunksize=1) File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value ValueError: invalid literal for int() with base 10: 'colabfold' ERROR conda.cli.main_run:execute(49): conda run python /projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py --import_json cdc11_output.json --experimental_data /projects/kefo9343/FragFold/FragFold/input/data/Savinov_2022_inhib_peptide_mapping.csv failed. (See above for error) Elapsed: 0hrs 0min 4sec [kefo9343@c3cpu-a2-u32-1 FragFold]$ vim /projects/kefo9343/FragFold/FragFold/input/data/Savinov_2022_inhib_peptide_mapping.csv

swanss commented 3 months ago

Hi! Sorry you're still running into issues, it looks like in this case something is broken in the filename parsing.

Could you please send me

The JSON file cdc11_output.json
A zipfile of the output directory? To keep it from being too large of a file, you can exclude the pdb/a3m files with something like this zip -rq 230526_colabfold115_ftsZ.zip 230526_colabfold115_ftsZ -x "*.pdb" -x "*.a3m"

Once I have these I can figure out what's causing the issue :)

swanss commented 3 months ago

Upon second thought, it would be nice to get your whole colabfold output directory so I can check that it runs all the way through and gives you the proper csv output. Could you please send your email? I will add you to a dropbox where you can upload a zip file containing all of the output (the PDBs are necessary for the contact counting step).

kf-cuanschutz commented 3 months ago

Thank you so much! My email address is kevin.fotso@cuanschutz.edu

kf-cuanschutz commented 3 months ago

Hi @swanss did you send me the dropbox invite? Let me know if you need any information.

swanss commented 3 months ago

Just shared it, let me know if you have any trouble uploading

https://www.dropbox.com/scl/fo/ngzjn3yfx6uieang77d90/AJa-5xK5MYpn34EQJ9den54?rlkey=1iittzbn9i2rt7pkgv6n123xm&dl=0

kf-cuanschutz commented 3 months ago

Sorry for the delay still transferring the data between different filesystems. I will let you know once it completes and I can upload the dataset. Thank you so much!

swanss commented 2 months ago

Just checking, is it still uploading? I see the repo, CDC111 a3m files, and a bunch of out/err files, but I don't see the output from the colabfold jobs.

kf-cuanschutz commented 2 months ago

Hi @swanss everything should be uploaded now. Let me know if you have a question.

swanss commented 2 months ago

Hi,

I'm a little confused by your output. In particular, I only see the log files from the colabfold jobs (which indicate that it ran successfully), but none of the actual output (e.g. PDB files).

Where is the script that you ran to submit the colabfold jobs? The idea is that you're supposed to copy ./fragfold/submit_jobs/submit_fragfold.sh to a new directory, set the variables that are specific to your job, and then run it to submit the jobs.
Did you try running the example to make sure everything works? It looks like you may have ran the first step and generated .a3m files, but I don't see anything else.

It looks like you're not the only one who has run into issues, so I will consider adding details to the README to help explain how to go through all the steps. As of right now, would you mind running the full example, up to the process output step, just to verify that it works?

kf-cuanschutz commented 2 months ago

Hi @swanss thank you very much for your help.

To be very clear, the script that I submitted was run_colabfold_process_output.slurm. Prior to submitting that script, I ran the script bash_fragfold.sh which called automatically the script submit_colabfold_slurmarray.sh. As a result, it produced the text files colabfold_5542203_. All of those scripts were submitted at the upper level directory "Fragfold".
Prior to this yes I ran the example as well and everything was working. The only steps I did not run were "Process output" and "Downstream analysis". I can try to run the process output step now as well.

Thank you for your help and let me know if I was not clear.

kf-cuanschutz commented 2 months ago

Hi,

When running process_output on the example I get the following error. Do you know what that means?

Loading data with 4 workers Traceback (most recent call last): File "/projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py", line 208, in <module> main(args) File "/projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py", line 145, in main colab_results = json.loads(file.read()) File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/json/__init__.py", line 348, in loads return _default_decoder.decode(s) File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 2 column 33 (char 34) ERROR conda.cli.main_run:execute(49):conda run python /projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py --import_json colabfold_output_ftsz.json --experimental_data /projects/kefo9343/FragFold/FragFold/input/data/Savinov_2022_inhib_peptide_mapping.csv` failed. (See above for error) Elapsed: 0hrs 0min 8sec

`

swanss commented 2 months ago

Hi @kf-cuanschutz,

Sorry for the delay, I've been busy with other commitments. Based on the error message, something is wrong with the json file at line 2 column 33. Can you please share the file here (or on the dropbox) so I can give it a look?

More generally, I recognize that it's tricky to run the pipeline since there are multiple steps and it's not always clear which one is failing. I'm working on a new branch that will use nextflow as a workflow manager, this will be a lot more robust than the current bash scripts. Let me know if you're interested in testing the new branch once it's ready. It will require creating the fragfold environment fresh, with nextflow installed, but otherwise shouldn't be too difficult.

kf-cuanschutz commented 2 months ago

Hi @swanss thank you very much for the update! Yes I have attached the .json file with this message. Yes I would be very happy to test it! colabfold_output_ftsz.json

swanss commented 2 months ago

Oh, sorry the example was unclear: comments are not supported in JSON files (that was just to explain). Try this. It will work if the colabfold jobs ran successfully/they are in the directory you provided

colabfold_output_ftsz_ss.json

kf-cuanschutz commented 2 months ago

Thank you! Let me try that.

kf-cuanschutz commented 2 months ago

I just got this error.

Loading data with 4 workers
Processing results...
gene_name: ftsZ-coding-EcoliBL21DE3
condition: 30aa_monomer_ftsZ
multiprocessing.pool.RemoteTraceback: 

Traceback (most recent call last):
  File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py", line 28, in load_confidence_data
    start,end = int(start),int(end)
ValueError: invalid literal for int() with base 10: 'colabfold'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py", line 208, in <module>
    main(args)
  File "/projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py", line 159, in main
    confidence_df = get_confidence_dataframe(path,n_workers)
  File "/projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py", line 72, in get_confidence_dataframe
    data = pool.map(func=load_confidence_data,iterable=all_paths,chunksize=1)
  File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/projects/kefo9343/software/anaconda/envs/fragfold/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
ValueError: invalid literal for int() with base 10: 'colabfold'
ERROR conda.cli.main_run:execute(49): `conda run python /projects/kefo9343/FragFold/FragFold/fragfold/colabfold_process_output.py --import_json colabfold_output_ftsz_ss.json --experimental_data /projects/kefo9343/FragFold/FragFold/input/data/Savinov_2022_inhib_peptide_mapping.csv` failed. (See above for error)
Elapsed: 0hrs 0min 4sec

swanss commented 2 months ago

Hi,

Based on the fact that I couldn't find any of the output from ColabFold in the dropbox directory you uploaded from your job + the error you got when running the example is the same, I've concluded that there was an issue with the slurm array mode of the original bash script for submitting jobs. I included that mode because I know other SLURM clusters use it, but it's disabled on my HPC system, making it difficult to debug. I also noticed that the instructions in the pyproject.toml were preventing the src directory from being included in the package, which could also be a source of issues.

I've implemented a nextflow workflow for running FragFold which should make it easier to run and track the source of errors. I haven't yet merged it to the main branch, but I have pushed it here. If you're interested, please go ahead and follow the README to run the example, I tried to make it as detailed as possible. Make sure to pull all the changes, switch to nextflow branch, and reinstall the conda env/fragfold so that it reflects the newest version.

swanss / FragFold

Process output step still failing #4