pirovc / ganon

ganon2 classifies genomic sequences against large sets of references efficiently, with integrated download and update of databases (refseq/genbank), taxonomic profiling (ncbi/gtdb), binning and hierarchical classification, customized reporting and more
https://pirovc.github.io/ganon/
MIT License
86 stars 13 forks source link

Output variability between runs/systems? #286

Closed jfy133 closed 1 month ago

jfy133 commented 6 months ago

We are in the process in updating nf-core/taxprofiler to include ganon 2.0

During the establishment of CI tests we found there is slight variability in the output (where ganon 1 did not).

More specifically, @sofstam identified that while ganon v2 output compositionally was OK on our local machines, when the tests ran again but on GitHub Actions, the output appeared to not add up to 100% (99.64%)

Is this to be expected with the new update?

@sofstam can provide more info if necessary :)

pirovc commented 6 months ago

Thanks for adding ganon2 to the pipeline!

Is the difference in results between machines happening with the same data? It should indeed always add up to 100%. With access to the files and commands used I can gladly check what is the issue.

jfy133 commented 6 months ago

Yes it was the same data I believe @sofstam can you confirm (and share)?

Note the compositionality issue was detected with taxpasta, so there is also that to check, but that literally just adds up the selected column

sofstam commented 6 months ago

Yes the data are the same. Tried to upload them but the file is too big. Can be found here.

ganon_GHA.zip Results by running ganon 2 through GHA ganon_local.zip Results by running ganon 2 on my local machine

pirovc commented 6 months ago

Could you also send the .tax file of the database used?

sofstam commented 6 months ago

This is the db used.

test-db-ganon.zip

pirovc commented 6 months ago

I did some investigation. Using the reads 2612_pe against the provided database using the default parameters (v2.1.0): ganon classify -d test-db-ganon -p ERX5474932_ERR5766176_1.fastq.gz ERX5474932_ERR5766176_2.fastq.gz -o 2612_pe.ganon

I get the following output (2612_pe.ganon.rep):

H1                   NC_012920.1   59  59  0  sequence  NC_012920.1
H1                   NC_001136.10  1   1   0  sequence  NC_001136.10
#total_classified    60                                 
#total_unclassified  632000 

From the files you sent me, ganon_local/db1/2612_pe.ganon.rep looks like:

H1                   NC_012920.1   118  59  0  sequence  NC_012920.1
H1                   NC_001136.10  2    1   0  sequence  NC_001136.10
#total_classified    120                                  
#total_unclassified  1264000 

The report above has inconsistencies, therefore creating the issues reported here. I have some clues for what is happening (probably repeated headers) but I need to replicate it to be able to identify it and fix it. Could you please confirm if this data is correct and tell me is the exact command that taxprofiler is running?

sofstam commented 6 months ago

Thank you for looking into that. Yes, the data is correct.

Regarding the command, I clone this branch and then I run:

nextflow run main.nf -profile hasta,dev_prio,test,singularity --run_ganon --outdir test_ganon_output

pirovc commented 6 months ago

Thanks. I actually meant what is the ganon command generated by the taxprofiler.

jfy133 commented 6 months ago

@sofstam probably the .command.shfile in the work directory should suffice

sofstam commented 6 months ago

Sorry, I misread your comment above.

Both commands:

#!/bin/bash -euo pipefail
dbprefix=$(find -L . -name '*.*ibf' | sed 's/\.h\?ibf$//')

ganon \
    classify \
    --db-prefix ${dbprefix%%.*ibf} \
     \
    --threads 2 \
    --output-prefix 2611_se_db1.ganon \
    --single-reads ERX5474930_ERR5766174_1.fa.gz         2>&1 | tee 2611_se_db1.ganon.log

cat <<-END_VERSIONS > versions.yml
"NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:GANON_CLASSIFY":
    ganon: $(echo $(ganon --version 2>1) | sed 's/.*ganon //g')
#!/bin/bash -euo pipefail
dbprefix=$(find -L . -name '*.*ibf' | sed 's/\.h\?ibf$//')

ganon \
    report \
    --input 2611_se_db1.ganon.rep \
    --output-prefix 2611_se_db1.ganon_report \
    --db-prefix ${dbprefix%%.*ibf} \
    --report-type reads --ranks default --top-percentile 0 --min-count 0 --max-count 0

cat <<-END_VERSIONS > versions.yml
"NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:GANON_REPORT":
    ganon: $(echo $(ganon --version 2>1) | sed 's/.*ganon //g')
END_VERSIONS
pirovc commented 6 months ago

Can you also send the command used for the specific run 2612_pe?

sofstam commented 6 months ago
#!/bin/bash -euo pipefail
dbprefix=$(find -L . -name '*.*ibf' | sed 's/\.h\?ibf$//')

ganon \
    classify \
    --db-prefix ${dbprefix%%.*ibf} \
     \
    --threads 2 \
    --output-prefix 2612_pe_ERR5766176_db1.ganon \
    --paired-reads ERX5474932_ERR5766176_1.fastq.gz ERX5474932_ERR5766176_2.fastq.gz         2>&1 | tee 2612_pe_ERR5766176_db1.ganon.log

cat <<-END_VERSIONS > versions.yml
"NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:GANON_CLASSIFY":
    ganon: $(echo $(ganon --version 2>1) | sed 's/.*ganon //g')
END_VERSIONS
#!/bin/bash -euo pipefail
dbprefix=$(find -L . -name '*.*ibf' | sed 's/\.h\?ibf$//')

ganon \
    report \
    --input 2612_pe_ERR5766176_B_db1.ganon.rep \
    --output-prefix 2612_pe_ERR5766176_B_db1.ganon_report \
    --db-prefix ${dbprefix%%.*ibf} \
    --report-type reads --ranks default --top-percentile 0 --min-count 0 --max-count 0

cat <<-END_VERSIONS > versions.yml
"NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:GANON_REPORT":
    ganon: $(echo $(ganon --version 2>1) | sed 's/.*ganon //g')
END_VERSIONS
pirovc commented 6 months ago

Thanks for the commands.

Yes the data are the same. Tried to upload them but the file is too big. Can be found here.

ganon_GHA.zip Results by running ganon 2 through GHA ganon_local.zip Results by running ganon 2 on my local machine

From the files you sent, 2 have problems, one in the local (2612_pe_db1.ganon.rep) and the other in the GHA (2612_se_db1.ganon.rep). This means it's not a machine specific issue. Only by looking at the .rep file after ganon classify is possible to detect errors, the sum of columns 4 and 5 should match the #total_classified value reported at the end.

For the local (2612_pe_db1.ganon.rep) I could "replicate" the outcome by providing the input files twice ganon classify --db-prefix test-db-ganon --threads 2 --output-prefix 2612_pe_db1.ganon --paired-reads ERX5474932_ERR5766176_1.fastq.gz ERX5474932_ERR5766176_2.fastq.gz ERX5474932_ERR5766176_1.fastq.gz ERX5474932_ERR5766176_2.fastq.gz. This runs fine but triggers a know bug in the reassignment process, where repeated headers are counted only once.

For the GHA (2612_se_db1.ganon.rep) I have no clue. The number of reads do not match the supposed input as well.

Before I go into any further debugging, can you double check if the fastq files used are as expected?

sofstam commented 6 months ago

Thanks for your reply.

We only updated the module and we did not change anything in taxprofiler. Is this what you mean if the fastq files are as expected?

pirovc commented 6 months ago

I mean to double check that all the inputs are correct, for example, the files are not corrupted or somehow duplicated.

If not, I would need some more information to try understand what is happening. Could you re-run the 2 problematic files I mentioned above (just ganon classify is enough) with --verbose and send me the logs?

sofstam commented 6 months ago

Sorry for my late reply.

2612_se_ERR5766180_db1.ganon.log

2612_pe_ERR5766176_db1.ganon.log

pirovc commented 6 months ago

Thanks, the logs are looking alright. Can I also have the .rep files generated from those runs? (2612_pe_ERR5766176_db1.ganon.rep and 2612_se_ERR5766180_db1.ganon.rep)

sofstam commented 6 months ago

Had to rename to .txt in order to upload them here.

2612_pe_ERR5766176_db1.ganon.txt

2612_se_ERR5766180_db1.ganon.txt

pirovc commented 6 months ago

Thanks again. I could not detect any issue in those files and reports generated with them will add up to 100%. Note that they are different from the ones you sent before.

The only way for me to try detect a possible bug would be to have the .rep and .log (with --verbose) from the files with issues, which are (based on those zips):

jfy133 commented 1 month ago

Sorry on the delay of this @pirovc , I am meant to be tasked with updating the nf-core modules and make sure this is an error at that end, but I've been on part-time parental leave since January so very very slow going.

jfy133 commented 1 month ago

will reopen if it happens again

pirovc commented 1 month ago

No worries, just keeping the house clean, feel free to re-open anytime.