Open slambrechts opened 10 months ago
Hello Sam, Thanks for reaching out and providing details on the issue you're facing with PipeCraft.
Could you please provide more information about how you formatted your input files?
NextITS expects input files to be arranged in a specific structure within the Input
directory.
Each sequencing run should have its own sub-directory containing all the sample files related to that run.
Here is an example of how your directory should be structured:
Input
├── Run1
│ ├── Run1__Sample1.fq.gz
│ ├── Run1__Sample2.fq.gz
│ └── Run1__Sample3.fq.gz
├── Run2
│ ├── Run2__Sample4.fq.gz
│ ├── Run2__Sample5.fq.gz
│ └── Run2__Sample6.fq.gz
└── Run3
├── Run3__Sample7.fq.gz
├── Run3__Sample8.fq.gz
└── Run3__Sample9.fq.gz
If you have multiple runs, it is important to use a consistent naming convention for the samples, like RunID__SampleID
(note the double underscore between the RunID and the SampleID).
For the chimera removal step, please make sure to use the database in UDB format. The UNITE database in UDB format can be downloaded from here: UNITE database UDB format.
While running the pipeline, you may encounter some WARN
warnings - these can be safely ignored (I need to update the config to hide them).
We acknowledge that the documentation for the NextITS part is lacking at this point, and we are working on updating it to provide clearer guidance. We appreciate your patience and hope to resolve any confusion as soon as possible.
If you have any further questions or require additional assistance, please don't hesitate to let us know. With kind regards, Vlad
Hi Vlad,
Thank you for the detailed instructions. Do the files need to be zipped? I have just one run, and input files look like this:
Sample1.fastq Sample2.fastq Sample3.fastq
Kind regards, Sam
You may use uncompressed files as well (fastq
or fq
extensions), but put the files into a sub-directory.
Directory structure is important, as we perform tag-jump removal, and it should be done for each sequencing run independently.
Ok great, good to know.
If I understand the new features correctly, I should not include the unkown.fastq
'sample' that contains reads for which it is unknown to which sample they belong (after running demultiplexing with 1 mismatch allowed for the barcodes), because the rescue chimeras feature can then rescue chimeras which are both in a 'real' sample and in these unkown reads, which might just be reads from the same sample
I restructured my input
directory and renamed my input files like so:
Input
├── Run1
│ ├── Run1__Sample1.fastq
│ ├── Run1__Sample2.fastq
│ └── Run1__Sample3.fastq
but I still get the same error message. I tried both with zipped and unzipped fastq files. Any idea what else might be causing this? My sample names do contain single underscores, but I guess this should be fine and is why you use double underscores to link Run and Sample ID?
rescue chimeras
That's an interesting question. You are right about the chimera rescue step. However, on the other hand, there is the concern about tag-jumping, where sequences might be misattributed between samples. This can be especially problematic if the 'unknown' sample contains sequences with high abundance. Therefore, by including the 'unknown' sample, you give the algorithm the maximum amount of information to correctly identify tag-jumps.
So I would rather keep the 'unknown' sample and let it pass all pipeline steps. At the end, you may just remove it from the output table.
but I still get the same error message.
Which working directory do you specify in PipeCraft (the directory one level above the Input
?).
If you have a Nextflow__*.log
and Step1.log
files in the working directory, could you please upload them somewhere?
I missed that I actually need to create a directory called Input
, so that solves it.
I included the 'unknown' sample, will report back once I have the results
Is the tag-jumping also a concern if you only used 1 PCR step?
We have a large primer-set where multiplexing indices/barcodes are already preattached to the primers (we order them with the indices/barcodes already in the primer sequences), so per sample we use a slightly different version of the primer (different index/barcode). After PCR we do a PCR-free library preparation step. I'm not entirely sure but this might reduce the chances of tag-jumping?
Hi, I got the same issue that it gives error at the second step:
ERROR ~ No files match pattern **/07_SeqTable/Seqs.RData
at path: /input/Step1_Results/
I have a folder my_dir_ITS/ I selected as Workfolder, then the Input/ folder inside of it, then the Run1/ folder including demultiplexed PacBio ITS (single-end) amplicons *.fq inside of Run1/. Anyway I aslo can't find 07_SeqTable/ in Step1_Results/, in fact even though the information in log file say step1 is successful, except 02_primer/, I have no other folders in Step1_Results/. Also by the log file, it seems like no further steps after check primer.
Many thanks in advance!
Hi! It seems that no reads were passing the primer check, i.e., none of your input sequences contained the specified primer strings. Please double-check your specified primers in the "STEP_1" and try re-running the pipe.
Hi, Anslan, thanks for your reply. I double checked my primer with my fasta, they are matched. My primer is: ITS1F: 5'-CTTGGTCATTTAGAGGAAGTAA-3' ITS4: 5'-TCCTCCGCTTATTGATATGC-3'
I put CTTGGTCATTTAGAGGAAGTAA as primer forward and TCCTCCGCTTATTGATATGC as primer reverse. My fasta looks like: @m64212_220714_091748/48955643/ccs CTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTACAGTATTCTTTTTGCCAGCGCTTAATTGCTCGGCGAAAAAACCTTACACACAGTGTTTTTTGTTATTACAAGAACTTTTGCTTTGGTCTGGACTAGAAATAGTTTGGGCCAGAGGTTTACTGAACTAAACTTCAATATTTATATTGAATTGTTATTTATTTAATTGTCAATTTGTTGATTAAATTCAAAAAATCTTCAAAACTTTCAACAACGGATCTCTTGGTTCTCGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATATGAATTGCAGATTTTCGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTCTGGTATTCCAGAGGGCATGCCTGTTTGAGCGTCATTTCTCTCTCAAACCTTCGGGTTTGGTATTGAGTGATACTCTTAGTTGAACTAGGCGTTTGCTTGAAATGTATTGGCATGAGTGGTACTGGATAGTGCTATATGACTTTCAATGTATTAGGTTTATCCAACTCGTTGAATAGTTTAATGGTATATTTCTCGGTATTCTAGGCTCGGCCTTACAATATAACAAACAAGTTTGACCTCAAATCAGGTAGGATTACCCGCTGAACTTAAGCATATCAATAAGCGGAGGA +
@m64212_220714_091748/1246504/ccs
CTTGGTCATTTAGAGGAAGTAAAAGTCATAACAAGGTTTCTGTAGGTGAACCTGCAGAAGGATCATTAGTGAATGCTTAGGGGAAATCCCACTCTGTGGGCCCCGACCCTTCACCAATATCCACAAACACCTGTGCACCGTTGGTGGCGCGTACCTTCCCTTCGCCGGGAGGGTGCTGTCAGCTGCCAACACCTTTTTTTACACAAACACTGGAGTTCTATGAAAGTGATTGTATCTTGTCCTTTGTGACAGAATATAAAACAACTTTCGACAACGGATCTCTTGGTTCTCCCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCTCCTTGGTATTCCGAGGAGCATGCCTGTTTGAGTGTCATGAATTTCTCAATCCTCATGGGTTTTTTCTCATGCTTGGGATTGGATTTGGATGCCTGGCCGCGTCACAGCGGCCCATCTGAAATGGATTAGCTGGACCCCTATCACGGGTTGGTTCTACTCAACGTATTAATTTCCAATCGTTGAGGACGGCATGATGCATGCAAGAGAGGCTCTCCTCTCCCAAGTGCGCCGGCCAAACCGTGGGGTTGGTCTGCTTCTAGCCCGGCGAAGAGAGAGTGTGTATGTAATGTGCGCGCCTCTCTGACCCACTCTTTCAATCTGGCCTCAAATCAGGTAGGATTACCCGCTGAACTTAAGCATATCAATAAGCGGAGGA
+
@Linlin-Xu , could you please show a log file for one of the samples? To do this, follow these steps:
Identify the working directory for the primer checking step:
In the Step1_WorkDirs/RunID/Nextflow__RunID.log
file, find a line with the primer_check
process. E.g.,
[82/679f58] Submitted process > primer_check (RunID__Sample2)
In this example, 82/679f58
points to a directory inside Step1_WorkDirs
.
Inside this directory, there should be a (hidden) file called .command.log
.
Please post the content of this log here.
Hi, thank you very much for your help. The .command.log file for sample 1 is as below:
`Input file: sample1.fq.gz Forward primer: GTACACACCGCCCGTCG Reverse primer: CCTSCSCTTANTDATATGC
Counting primers ..forward primer ..rc-forward primer ..reverse primer ..rc-reverse primer
Looking for multiple primer occurrences ..Processing forward primers ...No forward primer matches found (in both orientations) ..Processing reverse primers
Number of artefacts found: 28 ..Removing artefacts [INFO] 28 patterns loaded from file ..Extracting artefacts [INFO] 28 patterns loaded from file ..done ..Done
Reorienting sequences This is cutadapt 4.4 with Python 3.10.12 Command line parameters: -a GTACACACCGCCCGTCG;required;min_overlap=15...GCATATHANTAAGSGSAGG;required;min_overlap=17 --errors 2 --revcomp --rename {header} --discard-untrimmed --cores 1 --action none --output sample1_PrimerChecked.fq.gz no_multiprimers.fq.gz Processing single-end reads on 1 core ... Finished in 0.146 s (20.320 µs/read; 2.95 M reads/minute).
=== Summary ===
Total reads processed: 7,196 Reads with adapters: 0 (0.0%) Reverse-complemented: 0 (0.0%)
== Read fate breakdown == Reads discarded as untrimmed: 7,196 (100.0%) Reads written (passing filters): 0 (0.0%)
Total basepairs processed: 4,400,162 bp Total written (filtered): 0 bp (0.0%)
=== Adapter 1 ===
Sequence: GTACACACCGCCCGTCG...GCATATHANTAAGSGSAGG; Type: linked; Length: 17+19; 5' trimmed: 0 times; 3' trimmed: 0 times; Reverse-complemented: 0 times
All done
Removing empty files ./sample1_PrimerChecked.fq.gz ..Done `
From the log file, I see that primers are misspecified.
Shouldn't it be CTTGGTCATTTAGAGGAAGTAA
and TCCTCCGCTTATTGATATGC
, as you mentioned earlier?
Many thanks for your help, I did put CTTGGTCATTTAGAGGAAGTAA as primer forward and TCCTCCGCTTATTGATATGC as primer reverse in the pipecraft2 STEP_1. So, I'm confused.
I just also checked the pipecraft2_last_run_configuration.json file. The primer was still correct in it.
[[{"tooltip":"Settings for STEP_1 (sequence filtering processes per sequencing run) in NextITS pipeline","scriptName":"","imageName":"vmikk/nextits:0.5.0","serviceName":"Step_1","manualLink":"https://next-its.github.io/parameters/#step-1","disabled":"never","selected":"always","showExtra":false,"extraInputs":[{"name":"qc_maxee","value":1,"disabled":"never","tooltip":"Maximum number of expected errors","type":"numeric","rules":["_NuFrRa_e=>e>=1||\"ERROR: specify values >= 1\""]},{"name":"qc_maxhomopolymerlen","value":25,"disabled":"never","tooltip":"Threshold for a homopolymer region length in a sequence (default, 25)","type":"numeric","rules":["_NuFrRa_e=>e>=1||\"ERROR: specify values >= 1\""]},{"name":"qc_maxn","value":4,"disabled":"never","tooltip":"Discard sequences with more than the specified number of N’s","type":"numeric","rules":["_NuFrRa_e=>e>=1||\"ERROR: specify values >= 1\""]},{"name":"ITSx_evalue","value":"1e-2","disabled":"never","tooltip":"ITSx E-value cutoff threshold (default, 1e-1)","type":"numeric","rules":["_NuFrRa_e=>e>=0||\"ERROR: specify only values > 0\""]},{"name":"ITSx_partial","value":0,"disabled":"never","tooltip":"Keep partial ITS sequences (defalt, off), otherwise specify min length cutoff","type":"numeric","rules":["_NuFrRa_e=>e>=1||\"ERROR: specify values >= 1\""]},{"name":"chimera_database","active":false,"btnName":"select file","value":"undefined","disabled":"never","tooltip":"Database for reference-based chimera removal","type":"boolfile"},{"name":"ITSx_tax","items":["all","alveolata","bryophyta","bacillariophyta","amoebozoa","euglenozoa","fungi","chlorophyta","rhodophyta","phaeophyceae","marchantiophyta","metazoa","oomycota","haptophyceae","raphidophyceae","rhizaria","synurophyceae","tracheophyta","eustigmatophyceae","apusozoa","parabasalia"],"value":["all"],"disabled":"never","tooltip":"ITSx taxonomy profile (default, 'all')","type":"combobox"},{"name":"chimera_rescueoccurrence","value":2,"disabled":"never","tooltip":"Min occurrence of chimeric sequences required to rescue them (default, 2)","type":"numeric","rules":["_NuFrRa_e=>e>=1||\"ERROR: specify values >= 1\""]},{"name":"tj_f ","value":0.01,"disabled":"never","tooltip":"Tag-jump filtering, UNCROSS parameter
f(default, 0.01)","max":1,"min":0,"step":0.01,"type":"slide"},{"name":"tj_p","value":1,"disabled":"never","tooltip":"Tag-jump filtering parameter
p(default, 1)","type":"numeric","rules":["_NuFrRa_e=>e>=1||\"ERROR: specify values >= 1\""]},{"name":"hp","value":true,"disabled":"never","tooltip":"Homopolymer compression (default, true)","type":"bool"}],"Inputs":[{"name":"primer_forward","value":["CTTGGTCATTTAGAGGAAGTAA"],"disabled":"never","tooltip":"specify forward primer","type":"chip","iupac":true,"rules":["_NuFrRa_e=>e.length<=1||\"TOO MANY PRIMERS\""]},{"name":"primer_reverse","value":["TCCTCCGCTTATTGATATGC"],"disabled":"never","tooltip":"specify reverse primer","type":"chip","iupac":true,"rules":["_NuFrRa_e=>e.length<=1||\"TOO MANY PRIMERS\""]},{"name":"primer_mismatches","value":2,"disabled":"never","tooltip":"Maximum number of mismatches when searching for primers","type":"numeric","rules":["_NuFrRa_e=>e>=1||\"ERROR: specify values >= 1\""]},{"name":"its_region","items":["full","ITS1","ITS2"],"value":"full","disabled":"never","tooltip":"sub-regions of the internal transcribed spacer","type":"select"}]},{"tooltip":"Settings for STEP_2 (clustering) in NextITS pipeline","scriptName":"","imageName":"vmikk/nextits:0.5.0","serviceName":"Step_2","manualLink":"https://next-its.github.io/parameters/#step-2","disabled":"never","selected":"always","showExtra":false,"extraInputs":[{"name":"otu_iddef","value":2,"disabled":"never","tooltip":"Sequence similarity definition for tag-jump removal step (default, 2)","type":"numeric","rules":["_NuFrRa_e=>e>=1||\"ERROR: specify values >= 1\""]},{"name":"otu_qmask","items":["dust","none"],"value":"dust","disabled":"never","tooltip":"mask regions in sequences using the \"dust\" method, or do not mask (\"none\").","type":"select"},{"name":"swarm_fastidious","value":true,"disabled":"never","tooltip":"Link nearby low-abundance swarms (fastidious option)","type":"bool","depends_on":"state.NextITS[1].Inputs[0].value == \"swarm\" && state.NextITS[1].Inputs[2].value <= 1"},{"name":"unoise_alpha","value":2,"disabled":"never","tooltip":"Alpha parameter of UNOISE","type":"numeric","rules":["_NuFrRa_e=>e>=1||\"ERROR: specify values >= 1\""],"depends_on":"state.NextITS[1].Inputs[4].value == true"},{"name":"unoise_minsize","value":8,"disabled":"never","tooltip":"Minimum sequence abundance ","type":"numeric","rules":["_NuFrRa_e=>e>=1||\"ERROR: specify values >= 1\""],"depends_on":"state.NextITS[1].Inputs[4].value == true"},{"name":"max_MEEP","value":0.5,"disabled":"never","tooltip":"Maximum allowed number of expected errors per 100 bp","type":"numeric","rules":["_NuFrRa_e=>e>=0||\"ERROR: specify values >= 1\""]},{"name":"max_ChimeraScore","value":0.5,"disabled":"never","tooltip":"Maximum allowed de novo chimera score","type":"numeric","rules":["_NuFrRa_e=>e>=0||\"ERROR: specify values >= 1\""]},{"name":"lulu_match","value":95,"disabled":"never","tooltip":"Minimum similarity threshold","type":"numeric","rules":["_NuFrRa_e=>e>=0||\"ERROR: specify values >= 1\""],"depends_on":"state.NextITS[1].Inputs[3].value == true"},{"name":"lulu_ratio","value":1,"disabled":"never","tooltip":"Minimum abundance ratio","type":"numeric","rules":["_NuFrRa_e=>e>=0||\"ERROR: specify values >= 1\""],"depends_on":"state.NextITS[1].Inputs[3].value == true"},{"name":"lulu_ratiotype","items":["min","avg"],"value":"min","disabled":"never","tooltip":"Abundance ratio type - 'min' or 'avg'\t","type":"select","depends_on":"state.NextITS[1].Inputs[3].value == true"},{"name":"lulu_relcooc","value":0.95,"disabled":"never","tooltip":"Relative co-occurrence","type":"numeric","rules":["_NuFrRa_e=>e>=0||\"ERROR: specify values >= 1\""],"depends_on":"state.NextITS[1].Inputs[3].value == true"},{"name":"lulu_maxhits","value":0,"disabled":"never","tooltip":"Maximum number of hits (0 = unlimited)","type":"numeric","rules":["_NuFrRa_e=>e>=0||\"ERROR: specify values >= 1\""],"depends_on":"state.NextITS[1].Inputs[3].value == true"}],"Inputs":[{"name":"clustering_method","items":["vsearch","swarm","unoise"],"value":"vsearch","disabled":"never","tooltip":"Sequence clustering method","type":"select"},{"name":"otu_id","value":0.98,"disabled":"never","tooltip":"Sequence similarity for OTU clustering (default, 0.98)","max":1,"min":0,"step":0.01,"type":"slide"},{"name":"swarm_d","value":1,"disabled":"never","tooltip":"SWARM clustering resolution (d)","type":"numeric","rules":["_NuFrRa_e=>e>=0||\"ERROR: specify values >= 1\""],"depends_on":"state.NextITS[1].Inputs[0].value == \"swarm\""},{"name":"lulu","value":true,"disabled":"never","tooltip":"Run post-clustering curation with LULU","type":"bool"},{"name":"unoise","value":false,"disabled":"never","tooltip":"Perform denoising with UNOISE algorithm","type":"bool"}]}],"NextITS"]
Thanks for reporting this issue - we need to investigate it more thoroughly. Meanwhile, you may try to run NextITS as a standalone tool.
Hi @Linlin-Xu, we found the primer parameters passing correclty to NextITS. Please make sure you are specifying only a single primer pair for NextITS (i.e., removing the default primer stings in STEP_1).
Thank you, I did only put in a single primer in Pipecraft v1.0.0. When I try with the NextITS alone, the Step1 stopped after pass the disambiguate step: [8d/e125d8] disambiguate [100%] 1 of 1, cached: 1 ✔ [- ] qc_se - [- ] primer_check - [- ] itsx - [- ] seq_qual - [- ] homopolymer - [- ] chimera_ref - [- ] chimera_denovo - [- ] chimera_rescue - [- ] chimera_denovo_agg - [- ] glob_derep - [- ] pool_seqs - [- ] otu_clust - [- ] otu_tab - [- ] tj - [- ] prep_seqtab - [- ] read_counts - Pipeline completed at : 2024-07-30T13:18:37.542562+02:00 Duration : 3.6s Execution status : All done!
@Linlin-Xu , could you please check the log file (.command.log
) for the disambiguate
step? It should be inside the working directory in 8d/e125d8...
sub-directory.
Thank you very much for your reply, it works with docker!
Thanks. I think that input data (fastq) could be misspecified. Could you please verify that the path to the input data is correct and show the full command you used to run NextITS?
Hi, thanks for your answer, I found I put all fastqs as input, should only put directory.
I'm at the last step of Step1, but have error: Error in library(package = pkg, character.only = TRUE) : there is no package called ‘metagMisc’
I downloaded the R package in all my local Rs, but doesn't work.
Please try to run NextITS with Docker or Singularity container engines enabled (add -profile singularity
or -profile docker
to the command).
All dependencies are available in the containers.
Hi, thanks for your reply. It works with docker.
Thank you @slambrechts and @Linlin-Xu for shedding light on this issue, further investigation revealed that there was a coding error in the Linux release which led to improper primer passing to the NextITS workflow. The bug has been fixed and a new Linux installer has been uploaded for the 1.0.0 release (https://github.com/pipecraft2/pipecraft/releases/download/v1.0.0/pipecraft_1.0.0.deb).
We were unable to reproduce this error on the Windows platform and thus we believe NextITS is working as intended on Windows.
Hi,
When trying the Pipecraft v.1.0.0 NextITS workflow I get the following error messages:
It seems like Pipecraft does not recognize the processes that are defined in the full Pipecraft NextITS workflow, it runs like no processes or tools have been selected?
In any case, I selected my work directory containing demultiplexed PacBio ITS (single-end) amplicons (which have been demultiplexed using the previous version of pipecraft2 (v.0.1.4) using the
SELECT WORKDIR
button, after which I selected the NextITS pipeline using theSELECT PIPELINE
button in the top left corner. I only edited the reverse primer because it was missing the first two bases, and selected the UNITE_9_1_beta.fasta file as the chimera database for reference based chimera filtering. Other than that I did not change any of the other default parameters.