jennahamlin / mashwrapper

MIT License
5 stars 1 forks source link

ERROR: could not open "*.msh" for reading. #6

Closed Kincekara closed 1 year ago

Kincekara commented 1 year ago

@jennahamlin testGet fails at the MAKE_DATABASE step. I reproduced this error by changing the workdir. testUse runs without a problem. nextflow version 21.10.6.5660

[51/2f34d6] process > NFCORE_MASHWRAPPER:MASHWRAPPER:INPUT_CHECK:SAMPLESHEET_CHECK (inputReads.csv) [100%] 1 of 1 ✔
[0c/9dbeb4] process > NFCORE_MASHWRAPPER:MASHWRAPPER:ORGANISMSHEET_CHECK (inputDB.txt)              [100%] 1 of 1 ✔
[8a/196e16] process > NFCORE_MASHWRAPPER:MASHWRAPPER:DOWNLOAD_GENOMES (4)                           [100%] 6 of 6 ✔
[10/560d3e] process > NFCORE_MASHWRAPPER:MASHWRAPPER:MAKE_MASH (6)                                  [100%] 6 of 6 ✔
[6b/01f105] process > NFCORE_MASHWRAPPER:MASHWRAPPER:MAKE_DATABASE                                  [100%] 1 of 1, failed: 1 ✘
[-        ] process > NFCORE_MASHWRAPPER:MASHWRAPPER:SPECIES_ID                                     -
[-        ] process > NFCORE_MASHWRAPPER:MASHWRAPPER:COMBINED_OUTPUT                                -
[-        ] process > NFCORE_MASHWRAPPER:MASHWRAPPER:CUSTOM_DUMPSOFTWAREVERSIONS                    -
Execution cancelled -- Finishing pending tasks before exit
-[jennahamlin/mashwrapper] Pipeline completed with errors-

                Results will not be emailed. 
                Please check your specified out directory for the results. 
                Your results folder is called: ./results

Error executing process > 'NFCORE_MASHWRAPPER:MASHWRAPPER:MAKE_DATABASE'

Caused by:
  Process `NFCORE_MASHWRAPPER:MASHWRAPPER:MAKE_DATABASE` terminated with an error exit status (1)

Command executed:

  currentDate=`date +"%Y-%m-%d_%T"`

  if ls *noMash.msh &> /dev/null; then   
    rm *noMash.msh; 
    echo 'removing noMash.msh files'; 
    mash sketch *.msh -o myMashDatabase.$currentDate.msh -S 42; 
  else 
    echo 'only .msh files in directory'; 
    mash sketch *.msh -o myMashDatabase.$currentDate.msh -S 42; 
  fi

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_MASHWRAPPER:MASHWRAPPER:MAKE_DATABASE":
      mash: $(mash --version | sed 's/Mash //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  removing noMash.msh files

Command error:
  ERROR: could not open "*.msh" for reading.

Work dir:
  /mnt/dm-3/hpc-scratch/work/6b/01f10588f352f1d8c55fb23d8024df

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
jennahamlin commented 1 year ago

Hi @Kincekara

Could you please supply the exact command you ran on the command line. For example what profile did you use (i.e., singularity, conda, etc.). I suspect this is due to a configuration issue with how your compute cluster interacts with ncbi datasets command line tools.

My suggestion would be to try and run the --testGet command and ask it to use conda. This will be a bit slower because of conda downloads but this will tell you that the tool works and it is an issue with your set up on the compute cluster.

Again, the above is mostly speculation and providing the exact command would be helpful for solving the issue.

Kincekara commented 1 year ago

Hi @jennahamlin The Conda profile works without a problem. I used nextflow run mashwrapper -profile testGet,singularity command when I got the error.

I trace back the work files. Somehow, singularity datasets cannot download the genome files. Here is the .command.log

WARNING: While bind mounting '/mnt/dm-3/hpc-scratch/work/74/d3ba9ad79bdd3ca5d9047eedf7e5ff:/mnt/dm-3/hpc-scratch/work/74/d3ba9ad79bdd3ca5d9047eedf7e5ff': destination is already in the mount point list
false
Confirming both NCBI datasets and dataformat tools are available...
Great both tools available to access NCBI...

Beginning the process...
Checking your directory...
Good, a downloadedData.tsv summary file does not already exist. Continuing...
Good, the speciesCount.txt summary file doesn't already exist. Continuing...
allDownload directory does not exist, making it now and downloading will begin...
This is one of the species that will be downloaded to make the mash database: legionella jamestowniensis
Beginning to dowload genomes from NCBI...
Assembly level is not specified as the parameter is empty ...
Error: No assembly available
No  files available. Creating a file place holder for this species: legionellajamestowniensis. Exiting.
jennahamlin commented 1 year ago

@Kincekara Yay, glad the conda version works. I ran into the same problem with singularity on the compute cluster I was developing on. I fixed it by specifying a configuration file to work with the compute cluster specifically by providing this singularity.runOptions = '-B /etc/pki/ca-trust:/etc/pki/ca-trust' in the config file. As far as I can tell, the singularity image of ncbi datasets does not have the certs included and that is the issue.

I was under the assumption that this configuration requirement should not be an issue to others but in your case I suspect it is the same. So lets try that first. You should set up a config file for your compute cluster like the one I have done for cdc. see here- https://github.com/jennahamlin/mashwrapper/blob/main/conf/nfcore_custom.config

You will need to specify your cluster executor (e.g., sun grid engine etc.) and whatever your queue is called (e.g., all.q) and then you will need to include singularity.runOptions = '-B /etc/pki/ca-trust:/etc/pki/ca-trust just like in my conf file.

Once you have done that then the command would be:

nextflow run mashwrapper -profile testGet,singularity ---custom_config_base /scicomp/home-pure/ptx4/mashwrapper/conf

where you would change the path to your config file for --custom_config_base. Lastly, do not end the path with a final /, as it will not be able to locate the config file.

Kincekara commented 1 year ago

I updated my singularity config as you directed. It worked like a charm. Thank you!