hariszaf / pema

PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S rRNA, ITS and COI marker genes
27 stars 12 forks source link

pandaseq error "Something is wrong with this ID" #67

Closed stas-malavin closed 8 months ago

stas-malavin commented 8 months ago

Hi, I'm getting an error from pandaseq when using BGI reads. Before that, I ran a small dataset from the same BGI run (different barcode) successfully. I'm using singularity v. 3.8.6 and pema_v.2.1.4.sif. How can I run pandaseq-checkid "V350194505L1C001R00100000782/1 BH:ok" using a container? Here's the output:

ERR BADID   V350194505L1C001R00100000782:::1:0:0:0: V350194505L1C001R00100000782/1 BH:ok
* * * * * Something is wrong with this ID. If tags are absent, try passing the -B option.
* * * * * Consult `pandaseq-checkid "V350194505L1C001R00100000782/1 BH:ok"` to get an idea of the problem..
Task failed:
    Program & line     : '/home/modules/preprocess.bds', line 334
    Task Name          : ''
    Task ID            : 'pema_latest.bds.20240326_090348_413/task.preprocess.line_334.id_15'
    Task PID           : '1399067'
    Task hint          : '/home/tools/PANDAseq/bin/pandaseq -f filtered_max_ERR0000001_1.fastq.gz.1P.fastq.00.0_0.cor.fastq.gz -r filtered_max_ERR0000001_2.fastq.gz.2P.fastq.00'
    Task resources     : 'cpus: 1   mem: -1.0 B timeout: 86400  wall-timeout: 86400'
    State              : 'ERROR'
    Dependency state   : 'ERROR'
    Retries available  : '1'
    Input files        : '[]'
    Output files       : '[]'
    Script file        : '/home/bioinf/pema_latest.bds.20240326_090348_413/task.preprocess.line_334.id_15.sh'
    Exit status        : '1'
    StdErr (10 lines)  :
        0x1677340:1 STAT    READS   0
        0x1677340:1 STAT    NOALGN  0
        0x1677340:1 STAT    LOWQ    0
        0x1677340:1 STAT    BADR    0
        0x1677340:1 STAT    SLOW    0
        0x1677340:1 STAT    OK  0
        0x1677340:1 STAT    OVERLAPS    0
        ERR BADID   V350194505L1C001R00100000782:::1:0:0:0: V350194505L1C001R00100000782/1 BH:ok
        * * * * * Something is wrong with this ID. If tags are absent, try passing the -B option.
        * * * * * Consult `pandaseq-checkid "V350194505L1C001R00100000782/1 BH:ok"` to get an idea of the problem..
stas-malavin commented 8 months ago

Here are my parameters:

outputFolderName    Ayalon1_18S_pr2
EnaData Yes
sequencerPrefix V
maxInfo Yes
targetLength    100
strictness  0.8
adapters    TruSeq2-PE.fa
seedMismatches  0
palindromeClipThreshold 20
simpleClipThreshold 30
leading 20
trailing    2
minlen  100
threadsTrimmomatic  10
pandaseqAlgorithm   simple_bayesian
pandaseqThreads 10
pandaseqMinlen  
minoverlap  1
threshold   0.6
elimination 
vsearchThreads  20
vsearchId   0.97
gene    gene_18S
taxonomyAssignmentMethod    alignment
numberOfCoresForPapara  20
referenceDb pr2
taxonomyFolderName  my_taxon_assign
forwardITSPrimer    GATGAAGAACGYAGYRAA
reverseITSPrimer    CTBTTVCCKCTTCACTCG
clusteringAlgo  algo_Swarm
d   15
boundary    3
swarmThreads    20
removeSingletons    Yes
omp_num_threads 20
abskew  2
midori_version  midori_1
custom_ref_db   No
name_of_custom_db   partialCustomdb
getNCBITaxId    Yes
phyloseq    No
tree    Yes
raxmlThreads    20
parsTrees   10
bootstrapTrees  100
emptyRawDataFile    Yes
emptyCheckpoints    Yes
classifierAlgo  CREST
hariszaf commented 8 months ago

Hi @stas-malavin. Was this sample among the ones you used in your

stas-malavin commented 8 months ago

Sorry, I don't understand your question

hariszaf commented 8 months ago

Apologies. My question was whether the filtered_max_ERR0000001 sample that in your error seems to lead to that error, was also among the samples you had in your small test that performed fine.

My guess here is that maybe something's funny with the BGI format but I do no have experience with that. If you d like to you could send me the two paired read files of that sample to have a look.

How can I run pandaseq-checkid "V350194505L1C001R00100000782/1 BH:ok" using a container?

I am not sure I understand your question here, but if you would like to run pema after first initiating a container in an interactive way, you can always run

singularity exec -B <>:/mnt/analysis pema_v2.1.5.sif bash
cd /home

However, in the Singularity world you cannot edit a script, as it's a read-only environment, you could do this more straightforward in a Docker environment. However, pandaseq is already using the -B flag is that is that you were thinking.

Hope this helps and I d be happy to help more if I can

stas-malavin commented 8 months ago

Hi, filtered_max_ERR0000001 specifically comes from adjustingSequences step of the pipeline (from hammerBayes, I suppose). ERR0000001, the initial sample, was not among the samples I tested on. It wasn't a test, indeed, it was just another sample, and it went smoothly. It was only one (here in this set I have several), but it probably doesn't matter.

Here are the links to the files (forward and reverse): https://drive.google.com/file/d/185ttx3WXYqlmm7UAJZ1BO_PaKRtkVZD6/view?usp=sharing https://drive.google.com/file/d/15i3habgwlyazpZ3zs2r0NYMnSlY3bVe7/view?usp=sharing

Thanks!

hariszaf commented 8 months ago

Hi @stas-malavin

I was able to run your sample using pema v.2.1.5 that is about to be released with the fastp option.

It seems that your BGI reads (I guess they come from a non Illumina platform) come with a format that our initial pipeline cannot deal with for now.

I would suggest you go with the latest version of pema and the fastp option in the preprocess step.

If you have also Docker on your system, you can pull this by running

docker pull hariszaf/pema:v.2.1.5

if not, please let me know how I could send you the .sif file of the v.2.1.5 version.

Thanks again and thanks for bringing up the BGI inconsistency.

stas-malavin commented 8 months ago

Hi @hariszaf , I've pulled the Docker container. Could you please specify how do I run it? docker run --rm -it -v .:/mnt/analysis hariszaf/pema:v.2.1.5 from the manual just opens a shell inside the container. I also tried without -i, which stands for 'interactive', same result.

hariszaf commented 8 months ago

You may run

cd /home
./pema_latest.bds 

This is still a beta version - I am checking some last combinations but I ran your sample with fastp preprocess, Swarm and CREST and it went smooth. :rocket:

stas-malavin commented 8 months ago
root@a7faceac8cd9:~# cd /home
root@a7faceac8cd9:/home# ./pema_latest.bds 
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport
Fatal error: modules/initialize.bds, line 34, pos 16. Map 'my_case' does not have key 'preprocess'.
hariszaf commented 8 months ago

Apologies, I should have mentioned that this version will use an updated parameters file.

Attached the parameters.tsv I used ; github does not support tsv so just unzip this and get the file from there.

parameters.zip

hariszaf commented 8 months ago

One extra tip, I saw in your parameters that you ask for the NCBI id As you will have a vast number of ASVs, I would leave this part out and I would find another way to get the ncbi ids once the abundance table is ready.

stas-malavin commented 8 months ago

In an old parameters.tsv you had Swarm's d=15, while the current value is 3. Why such a big difference?.. And also, would appreciate if you share your experience. I've tried Swarm, and it always seemed a bit arbitrary to me, how to select d. Did you run any tests on that?

stas-malavin commented 8 months ago

Ah, I see, you're now removing oligotons from the occurrence 5

hariszaf commented 8 months ago

This value has a lot to do with the taxonomic group you are targeting. When using for example COI as your marker gene it's wquite common you go with high values ~12. If you are working with bacteria you d go with lower, even 1.

Did you run any tests on that?

If you would have a mock community, it would be super beneficial for you to set your params.

In the pema publication, i remember we had done some validations with different d values so maybe you could also have a look there, but the best thing is to always have a mock community to drive your parameters. I know this is hard and maybe not always an option, but to my knowledge is the best way to go :)

stas-malavin commented 8 months ago

good point, thanks

stas-malavin commented 8 months ago

how should I set the boundary parameter? It's empty in the current file

stas-malavin commented 8 months ago

Hi @hariszaf , the development version works, thank you.