failed run with Zymo control data

splaisan commented 1 year ago

Hi Vladimir,

This is for when you return from vacation, no hurry!

We got today the first run completed and I tried to analyse the data as you told me, providing the fastq folder as input and adding --demultiplexed true

The sequences blast nicely and returns many times Saccharomyces which is what we expect from the Zymo D6305 sample, however the nextflow run failed, apparently due to reference data not found (FATAL: container creation failed: mount /mnt/Dat2/DB/UNITE/Leho_Subset->/mnt/Dat2/DB/UNITE/Leho_Subset error: while mounting /mnt/Dat2/DB/UNITE/Leho_Subset: mount source /mnt/Dat2/DB/UNITE/Leho_Subset doesn't exist)

I attach an archive with the .nextflow log and my input data as well as my full command, I hope you will spot the issue and let me know (if you do not get the zip, just let me know).

Also, the amplicons from feces samples run in parallel (for which we have nice 16S) return much larger ITS-amplicon sequences 2-3kb which so far all blast to bacterial entities with various levels of identity.

Can you share what the expected on-target frequency for this PCR is with mixed populations with bacterial content? It seems very low in our case, I did not find yet a single non-bacterial hit (blast is running...)

Thanks for your tips and info

failed_run.zip

vmikk commented 1 year ago

Hello Stéphane,

Thank you for bringing this to my attention and for supplying the test data. It seems I made a rather simple error in the default configuration file; it currently directs to a reference database that is saved locally on my machine. To resolve this, please download the reference database and specify the path to the file on your system using --chimera_db (see the example command for guidance). I will fix the configuration file soon. Looking ahead, I plan to streamline this process further by automatically fetching the reference database in upcoming versions if it is not specified by the user.

I conducted a test run using the main branch of NextITS (commit 1d649ac18c):

## Dowload the reference database (~1.3GB)
curl -J -O "https://owncloud.ut.ee/owncloud/s/iaQ3i862pjwYgdy/download/UN95_chimera.udb"

## Pull the Singularity image (could be skipped and Nextflow will do it automatically)
mkdir -p Singularity_images
singularity pull library://vmiks/nextits/nextits:0-0-5
mv nextits_0-0-5.sif Singularity_images/vmiks-nextits-nextits-0-0-5.img

## Modify config (set the number of CPUs for ITSx)
cat > conf.config <<'EOT'
process {
  withName:itsx{
      cpus = 8
  }
}
EOT

## Move input data to input dir
mkdir -p Input/Zymo
cp bc1003--bc1061.fastq.gz Input/Zymo/

## Params
forp="TACACACCGCCCGTCG"
revp="CCTSCSCTTANTDATATGC"

## Point Nextflow to use the downloaded image
export NXF_SINGULARITY_CACHEDIR="$(pwd)/Singularity_images/"

## Step-1 [QC and ITS extraction]
nextflow run vmikk/NextITS -r main \
  -resume -profile singularity \
  -c conf.config \
  --input $(pwd)/Input/Zymo/ \
  --demultiplexed true \
  --primer_forward ${forp} \
  --primer_reverse ${revp} \
  --chimera_db $(pwd)/UN95_chimera.udb \
  --outdir  "Step1_Results/Zymo" \
  -work-dir "Step1_wd"

## Step-2 [Clustering]
nextflow run vmikk/NextITS -r main \
  -main-script Step2_AggregateRuns.nf \
  -resume -profile singularity \
  --data_path "$(pwd)/Step1_Results" \
  --outdir     "Step2_Results" \
  -work-dir    "Step2_wd" \
  --clustering_method "vsearch"

Regarding the processing time, you can expect Step-1 to take around 9 minutes with the current setup. The ITSx process is the primary bottlneck now. To speed up processing, you may limit ITSx to the fungal profile (using the --ITSx_tax Fungi option), or allocate more CPUs to ITSx (by default, it will use 2; but for this run I've set it to 8 CPUs).

With the default settings, I've got six sequences in the output table (Step2_Results/05.LULU/OTU_table_LULU.txt.gz). Preliminary analysis shows that the first two sequences are most abundant (4599 and 1415 reads), closely aligning with Saccharomyces cerevisiae and Cryptococcus neoformans upon a quick BLAST search. The remaining, less abandant sequences resemble Saccharomyces, showing some matches in the UNITE database. If you have sequenced other samples in this run, it is possible that these sequences are the result of tag-jumps. To confirm this, we would require data from the other samples in the sequencing run.

Do let me know if you need further clarification or assistance. I appreciate your understanding as I work to improve the pipeline.

PS. If you prefer, you may also use Docker instead of Singularity (just specify -profile docker).

splaisan commented 1 year ago

Thanks a lot Vladimir, it worked like a charm with your script and I got my data analysed in 8min (1) too. I now need to get the results in R and produce some plots.

Surprisingly, the mouse feces samples sequenced in parallel did not give any eukaryote OTUs despite the publications stating that mice do have them in their feces.

I wonder why the Zymo positive control PCR shows the expected Cerevisiae while the PCR on mice gDNA samples only gives artefactual bacterial amplicons (blast results; longer reads in the range of 1-3kb) and no fungi or yeasts. The counterpart 16S V1V9 PCR worked fine on the same DNA's.

Anyway, your software does the job and confirms my manual blast results (or absence thereof), thank you for that great tool.

Cheers, Stephane

splaisan commented 1 year ago

Hi Vladimir, sorry to reopen this, I modified your second command to get result files copied instead of linked (more practical for data delivery to the end-user) but I seem to still get symlinks.

Can you please check the command below?

## Step-2 [Clustering]
nextflow run vmikk/NextITS -r main \
  -main-script Step2_AggregateRuns.nf \
  -resume -profile singularity \
  --data_path "$(pwd)/Step1_Results" \
  --outdir     "Step2_Results" \
  -work-dir    "Step2_wd" \
  --clustering_method "vsearch"

edited to

## Step-2 [Clustering]
nextflow run vmikk/NextITS -r main \
  -main-script Step2_AggregateRuns.nf \
  -resume \
  -profile singularity \
  --data_path "$(pwd)/Step1_Results_HiFi_reads" \
  --clustering_method "vsearch" \
  --storagemode "copy" \
  --outdir     "Step2_Results_HiFi_reads" \
  -work-dir    "Step2_wd_HiFi_reads"

but producing:

Step2_Results_HiFi_reads
├── [4.0K]  01.Dereplicated
│   ├── [  94]  Dereplicated.fa.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/30/5b667a7c272bcebd96a745ce8ef3ef/Dereplicated.fa.gz
│   └── [  94]  Dereplicated.uc.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/30/5b667a7c272bcebd96a745ce8ef3ef/Dereplicated.uc.gz
├── [4.0K]  03.Clustered_VSEARCH
│   ├── [  91]  Clustered.fa.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/15/4e9d2e82c077f96ebca009552959ab/Clustered.fa.gz
│   └── [  91]  Clustered.uc.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/15/4e9d2e82c077f96ebca009552959ab/Clustered.uc.gz
├── [4.0K]  04.PooledResults
│   ├── [  86]  OTUs.fa.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/d7/7fd93034a612446fb60cf236dd2e44/OTUs.fa.gz
│   ├── [  96]  OTU_table_long.RData -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/d7/7fd93034a612446fb60cf236dd2e44/OTU_table_long.RData
│   ├── [  97]  OTU_table_long.txt.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/d7/7fd93034a612446fb60cf236dd2e44/OTU_table_long.txt.gz
│   ├── [  96]  OTU_table_wide.RData -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/d7/7fd93034a612446fb60cf236dd2e44/OTU_table_wide.RData
│   └── [  97]  OTU_table_wide.txt.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/d7/7fd93034a612446fb60cf236dd2e44/OTU_table_wide.txt.gz
└── [4.0K]  05.LULU
    ├── [  98]  LULU_match_list.txt.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/ef/d11852ef81cf73dcb83b7b6a1be934/LULU_match_list.txt.gz
    ├── [ 106]  LULU_merging_statistics.txt.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/ef/d11852ef81cf73dcb83b7b6a1be934/LULU_merging_statistics.txt.gz
    ├── [  91]  OTUs_LULU.fa.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/ef/d11852ef81cf73dcb83b7b6a1be934/OTUs_LULU.fa.gz
    └── [  97]  OTU_table_LULU.txt.gz -> /opt/biotools/NextITS/Step2_wd_HiFi_reads/ef/d11852ef81cf73dcb83b7b6a1be934/OTU_table_LULU.txt.gz

vmikk commented 1 year ago

Hello Stephane,

Unfortunately, the storagemode parameter has not yet been implemented in the Step-2 script. I understand the convenience it would offer, and I have plans to add it in the nearest future.

In the meantime, I must apologize for any inconvenience this may have caused you. While the current usage of symlinks helps in conserving drive space, I agree with you that having plain copies can often be more straightforward to handle.

As a temporary workaround, you can create a plain copy by following the symlinks using the cp command. E.g.:

cp -r -L Step2_Results_HiFi_reads Step2_Results_HiFi_reads_Copy

Also, if you are copying data from the remote host, you may use rsync -L ....

With kind regards, Vladimir

vmikk commented 1 year ago

PS. You may also modify the Nextflow script:

sed -i 's/symlink/copy/g' ~/.nextflow/assets/vmikk/NextITS/Step2_AggregateRuns.nf

splaisan commented 1 year ago

Thanks, I found the rsync -avL solution while you were answering, easy and perfect for my use. I will consider placing the command in the Step2 file although I prefer to leave your code untouched if possible. I also noted that the final results refer to the OTUs with cryptic hex labels. Is blasting them to NT the best way to get the taxonomy classification (7-levels) or do you have a better alternative ? thanks again! Stephane

vmikk commented 1 year ago

For taxonomy annotation, usually we are blasting sequences against UNITE database, e.g., you may use the General FASTA release.

For a more refined approach for ITS, we are currently developing a method for species hypothesis (SH) matching. This approach aims to produce more ecologically meaningful OTUs and will incorporate taxonomy annotations with SH DOIs. https://biss.pensoft.net/article/93856/

splaisan commented 1 year ago

Thanks, I tried blasting against Unite and it worked like a charm (and much faster than NT obviously), now just a matter of parsing the output. Great about your future improved method, looking forward to see it implemented in your pipeline. Now I close this issue for good ;-) Have a great evening

vmikk / NextITS

failed run with Zymo control data #3