Issue with excessive symlinking

multimeric commented 1 year ago

Description of the bug

We want to run proteinfold on our cluster, where we already have the AlphaFold data. However, using passing the --alphafold2_db flag to point to this data, proteinfold tries to symlink in the thousands of files located in that directory tree. This causes obvious issues.

Detailed Explanation

Firstly, I believe that the parameters relating to the alphafold database are not explained in enough detail. What files need to be in each path? What directory structure? Should the databases be zipped or unzipped? Knowing this would allow us to diagnose this issue better
This issue manifested itself as Command output: sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long, because the excessive symlinking resulting in a 20Mb sbatch script. On other environments it would manifest differently

Command used and terminal output

`nextflow run nf-core/proteinfold  --input samples.csv  --outdir ./output --mode alphafold2 --alphafold2_db /vast/projects/alphafold/databases --full_dbs=true --alphafold2_model_preset monomer --use_gpu=true -profile wehi`

Relevant files

The .command.run file that causes the issues: command.zip

System information

Nextflow 22.10.4
HPC
Slurm executor
Singularity engine
CentOS 7 OS
proteinfold 1.0.0

athbaltzis commented 1 year ago

Hi Michael,

You can find below an example of the required directory structure. Thanks for pointing it out. I will add it to the documentation.

If indeed this issue is due to the .command.run size, then I have an idea and will implement it in the next few days.

├── mgnify
│   └── mgy_clusters_2018_12.fa
├── alphafold_params_2022-03-02
│   ├── LICENSE
│   ├── params_model_1_multimer.npz
│   ├── params_model_1_multimer_v2.npz
│   ├── params_model_1.npz
│   ├── params_model_1_ptm.npz
│   ├── params_model_2_multimer.npz
│   ├── params_model_2_multimer_v2.npz
│   ├── params_model_2.npz
│   ├── params_model_2_ptm.npz
│   ├── params_model_3_multimer.npz
│   ├── params_model_3_multimer_v2.npz
│   ├── params_model_3.npz
│   ├── params_model_3_ptm.npz
│   ├── params_model_4_multimer.npz
│   ├── params_model_4_multimer_v2.npz
│   ├── params_model_4.npz
│   ├── params_model_4_ptm.npz
│   ├── params_model_5_multimer.npz
│   ├── params_model_5_multimer_v2.npz
│   ├── params_model_5.npz
│   └── params_model_5_ptm.npz
├── pdb70
│   └── pdb70_from_mmcif_200916
│       ├── md5sum
│       ├── pdb70_a3m.ffdata
│       ├── pdb70_a3m.ffindex
│       ├── pdb70_clu.tsv
│       ├── pdb70_cs219.ffdata
│       ├── pdb70_cs219.ffindex
│       ├── pdb70_hhm.ffdata
│       ├── pdb70_hhm.ffindex
│       └── pdb_filter.dat
├── pdb_mmcif
│   ├── mmcif_files
│   │   ├── 1g6g.cif
│   │   ├── 1go4.cif
│   │   ├── 1isn.cif
│   │   ├── 1kuu.cif
│   │   ├── 1m7s.cif
│   │   ├── 1mwq.cif
│   │   ├── 1ni5.cif
│   │   ├── 1qgd.cif
│   │   ├── 1tp9.cif
│   │   ├── 1wa9.cif
│   │   ├── 1ye5.cif
│   │   ├── 1yhl.cif
│   │   ├── 2bjd.cif
│   │   ├── 2bo9.cif
│   │   ├── 2e7t.cif
│   │   ├── 2fyg.cif
│   │   ├── 2j0q.cif
│   │   ├── 2jcq.cif
│   │   ├── 2m4k.cif
│   │   ├── 2n9o.cif
│   │   ├── 2nsx.cif
│   │   ├── 2w4u.cif
│   │   ├── 2wd6.cif
│   │   ├── 2wh5.cif
│   │   ├── 2wji.cif
│   │   ├── 2yu3.cif
│   │   ├── 3cw2.cif
│   │   ├── 3d45.cif
│   │   ├── 3gnz.cif
│   │   ├── 3j0a.cif
│   │   ├── 3jaj.cif
│   │   ├── 3mzo.cif
│   │   ├── 3nrn.cif
│   │   ├── 3piv.cif
│   │   ├── 3pof.cif
│   │   ├── 3pvd.cif
│   │   ├── 3q45.cif
│   │   ├── 3qh6.cif
│   │   ├── 3rg2.cif
│   │   ├── 3sxe.cif
│   │   ├── 3uai.cif
│   │   ├── 3uid.cif
│   │   ├── 3wae.cif
│   │   ├── 3wt1.cif
│   │   ├── 3wtr.cif
│   │   ├── 3wy2.cif
│   │   ├── 3zud.cif
│   │   ├── 4bix.cif
│   │   ├── 4bzx.cif
│   │   ├── 4c1n.cif
│   │   ├── 4cej.cif
│   │   ├── 4chm.cif
│   │   ├── 4fzo.cif
│   │   ├── 4i1f.cif
│   │   ├── 4ioa.cif
│   │   ├── 4j6o.cif
│   │   ├── 4m9q.cif
│   │   ├── 4mal.cif
│   │   ├── 4nhe.cif
│   │   ├── 4o2w.cif
│   │   ├── 4pzo.cif
│   │   ├── 4qlx.cif
│   │   ├── 4uex.cif
│   │   ├── 4zm4.cif
│   │   ├── 4zv1.cif
│   │   ├── 5aj4.cif
│   │   ├── 5frs.cif
│   │   ├── 5hwo.cif
│   │   ├── 5kbk.cif
│   │   ├── 5odq.cif
│   │   ├── 5u5t.cif
│   │   ├── 5wzq.cif
│   │   ├── 5x9z.cif
│   │   ├── 5xe5.cif
│   │   ├── 5ynv.cif
│   │   ├── 5yud.cif
│   │   ├── 5z5c.cif
│   │   ├── 5zb3.cif
│   │   ├── 5zlg.cif
│   │   ├── 6a6i.cif
│   │   ├── 6az3.cif
│   │   ├── 6ban.cif
│   │   ├── 6g1f.cif
│   │   ├── 6ix4.cif
│   │   ├── 6jwp.cif
│   │   ├── 6ng9.cif
│   │   ├── 6ojj.cif
│   │   ├── 6s0x.cif
│   │   ├── 6sg9.cif
│   │   ├── 6vi4.cif
│   │   └── 7sp5.cif
│   └── obsolete.dat
├── pdb_seqres
│   └── pdb_seqres.txt
├── small_bfd
│   └── bfd-first_non_consensus_sequences.fasta
├── uniclust30
│   └── uniclust30_2018_08
│       ├── uniclust30_2018_08_a3m_db -> uniclust30_2018_08_a3m.ffdata
│       ├── uniclust30_2018_08_a3m_db.index
│       ├── uniclust30_2018_08_a3m.ffdata
│       ├── uniclust30_2018_08_a3m.ffindex
│       ├── uniclust30_2018_08.cs219
│       ├── uniclust30_2018_08_cs219.ffdata
│       ├── uniclust30_2018_08_cs219.ffindex
│       ├── uniclust30_2018_08.cs219.sizes
│       ├── uniclust30_2018_08_hhm_db -> uniclust30_2018_08_hhm.ffdata
│       ├── uniclust30_2018_08_hhm_db.index
│       ├── uniclust30_2018_08_hhm.ffdata
│       ├── uniclust30_2018_08_hhm.ffindex
│       └── uniclust30_2018_08_md5sum
├── uniprot
│   └── uniprot.fasta
└── uniref90
    └── uniref90.fasta

multimeric commented 1 year ago

Yeah so your directory seems to have the same structure, it's just that we have thousands of mmCIF files. I'm surprised that you don't?

$ ls pdb_mmcif/mmcif_files | head
100d.cif
101d.cif
101m.cif
102d.cif
102l.cif
102m.cif
103d.cif
103l.cif
103m.cif
104d.cif
$ ls pdb_mmcif/mmcif_files | wc -l
183793

Considering this, I think finding a solution to the symlinking issue would be ideal.

Also, might it be possible to document the AlphaFold version that is being used for each pipeline release? Because we have several versions of AlphaFold installed with different databases, and we need to know which version should be used with proteinfold.

athbaltzis commented 1 year ago

I do have too. What I pasted is a reduced version of the databases I use for testing in order for you to see the structure.

athbaltzis commented 1 year ago

Let me know whether #89 works in order to close this issue.

athbaltzis commented 1 year ago

I assume that the fix worked so I close the issue. Please feel free to re-open it in case it didn't work.

multimeric commented 1 year ago

Hi @athbaltzis, sorry for the late reply.

It doesn't look like this fix worked. I re-ran the pipeline, and it failed with sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long.

I've attached nextflow's output below:

``` INFO: set NXF_WORK=/vast/scratch/users/milton.m/nextflow/work N E X T F L O W ~ version 22.10.4 Launching `https://github.com/nf-core/proteinfold` [stoic_bhabha] DSL2 - revision: 22a2ada9c2 [master] ------------------------------------------------------ ,--./,-. ___ __ __ __ ___ /,-._.--~' |\ | |__ __ / ` / \ |__) |__ } { | \| | \__, \__/ | \ |___ \`-._,-`-, `._,._,' nf-core/proteinfold v1.0.0-g22a2ada ------------------------------------------------------ Core Nextflow options revision : master runName : stoic_bhabha containerEngine : singularity launchDir : /vast/scratch/users/milton.m/proteinfold-nf workDir : /vast/scratch/users/milton.m/nextflow/work projectDir : /home/users/allstaff/milton.m/.nextflow/assets/nf-core/proteinfold userName : milton.m profile : wehi configFiles : /home/users/allstaff/milton.m/.nextflow/config, /home/users/allstaff/milton.m/.nextflow/assets/nf-core/proteinfold/nextflow.config Global options input : samples.csv outdir : ./output use_gpu : true Alphafold2 options alphafold2_db : /vast/projects/alphafold/databases full_dbs : true alphafold2_model_preset : monomer Institutional config options config_profile_description : Walter and Eliza Hall Institute (WEHI) Milton HPC cluster profile config_profile_contact : Jacob Munro (munro.j@wehi.edu.au) [35/1915] config_profile_url : https://www.wehi.edu.au/ Max job request options max_cpus : 128 max_memory : 1.3 TB max_time : 2d Alphafold2 DBs and parameters links options bfd_path : /vast/projects/alphafold/databases/bfd/* small_bfd_path : /vast/projects/alphafold/databases/small_bfd/* alphafold2_params_path : /vast/projects/alphafold/databases/alphafold_params_*/* mgnify_path : /vast/projects/alphafold/databases/mgnify/* pdb70_path : /vast/projects/alphafold/databases/pdb70/** pdb_mmcif_path : /vast/projects/alphafold/databases/pdb_mmcif/** uniclust30_path : /vast/projects/alphafold/databases/uniclust30/** uniref90_path : /vast/projects/alphafold/databases/uniref90/* pdb_seqres_path : /vast/projects/alphafold/databases/pdb_seqres/* uniprot_path : /vast/projects/alphafold/databases/uniprot/* Colabfold DBs and parameters links options colabfold_db_path : null/colabfold_envdb_202108 uniref30_path : null/uniref30_2202 colabfold_alphafold2_params_path: null/params/alphafold_params_2021-07-14 colabfold_alphafold2_params_tags: [AlphaFold2-multimer-v1:alphafold_params_colab_2021-10-27, AlphaFold2-multimer-v2:alphafold_params_colab_2022-03-02, AlphaFold2-ptm:a lphafold_params_2021-07-14] !! Only displaying parameters that differ from the pipeline defaults !! ------------------------------------------------------ If you use nf-core/proteinfold for your analysis please cite: * The nf-core framework https://doi.org/10.1038/s41587-020-0439-x * Software dependencies https://github.com/nf-core/proteinfold/blob/master/CITATIONS.md ------------------------------------------------------ executor > slurm (1) executor > slurm (1) executor > slurm (1) [a8/b90c5b] process > NFCORE_PROTEINFOLD:ALPHAFOLD2:INPUT_CHECK:SAMPLESHEET_CHECK (samples.csv) [100%] 1 of 1 ✔ [d3/3762ab] process > NFCORE_PROTEINFOLD:ALPHAFOLD2:RUN_ALPHAFOLD2 (xab_T1) [ 50%] 1 of 2, failed: 1 [- ] process > NFCORE_PROTEINFOLD:ALPHAFOLD2:CUSTOM_DUMPSOFTWAREVERSIONS - [- ] process > NFCORE_PROTEINFOLD:ALPHAFOLD2:MULTIQC - Execution cancelled -- Finishing pending tasks before exit -[nf-core/proteinfold] Pipeline completed with errors- Error executing process > 'NFCORE_PROTEINFOLD:ALPHAFOLD2:RUN_ALPHAFOLD2 (xab_T1)' Caused by: Failed to submit process to grid scheduler for execution Command executed: sbatch .command.run Command exit status: 1 Command output: sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long Work dir: /vast/scratch/users/milton.m/nextflow/work/d3/3762ab02340f14ff0e45f902f051a0 Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out` ```

I've also attached the submit script that demonstrates this behaviour. .command.run.txt

JoseEspinosa commented 1 year ago

Hi @multimeric this should be fixed in the most recent edge version of Nextflow (23.05.0-edge), find here the corresponding issue. So maybe you can give it a try by updating Nextflow or adding to your command NXF_VER='23.05.0-edge' nextflow run ... Let us know if this works for you

JoseEspinosa commented 1 year ago

I guess the rocket means it worked, will close again the issue then

nf-core / proteinfold