mitoNGS / MToolBox

A bioinformatics pipeline to analyze mtDNA from NGS data
http://sourceforge.net/projects/mtoolbox/?source=navbar
GNU General Public License v3.0
90 stars 37 forks source link

Error in the "Assembling MT GENOMES WITH assemblemtegenome #64

Closed Confurious closed 5 years ago

Confurious commented 6 years ago

Hello, I see at least two people encountering the same issue , which seems to be a problem with the median function defined in the mtVariantCaller.py script. Can any one help? Thanks

ASSEMBLING MT GENOMES WITH ASSEMBLEMTGENOME...

WARNING: values of tail < 5 are deprecated and will be replaced with 5

[bam_sort_core] merging from 6 files... [bam_translate] RG tag "sample" on read "A00509:9:HF2LVDMXX:1:1104:15076:9972" encountered with no corresponding entry in header, tag lost. Unknown tags are only reported once per input file for each tag ID. [bam_translate] RG tag "sample" on read "A00509:9:HF2LVDMXX:1:2222:27968:7122" encountered with no corresponding entry in header, tag lost. Unknown tags are only reported once per input file for each tag ID. [bam_translate] RG tag "sample" on read "A00509:9:HF2LVDMXX:1:1156:6885:20901" encountered with no corresponding entry in header, tag lost. Unknown tags are only reported once per input file for each tag ID. [bam_translate] RG tag "sample" on read "A00509:9:HF2LVDMXX:1:1365:29279:35164" encountered with no corresponding entry in header, tag lost. Unknown tags are only reported once per input file for each tag ID. [bam_translate] RG tag "sample" on read "A00509:9:HF2LVDMXX:1:1457:21531:2988" encountered with no corresponding entry in header, tag lost. Unknown tags are only reported once per input file for each tag ID. [bam_translate] RG tag "sample" on read "A00509:9:HF2LVDMXX:1:1244:12102:15029" encountered with no corresponding entry in header, tag lost. Unknown tags are only reported once per input file for each tag ID. **mpileup] 1 samples in 1 input files

Set max per-file depth to 8000 Traceback (most recent call last): File "/Users/ming/Documents/MToolBox-master/MToolBox/assembleMTgenome.py", line 443, in mut_events = mtvcf_main_analysis(mt_table, sam_file, sample_name, tail=tail) File "/Users/ming/Documents/MToolBox-master/MToolBox/mtVariantCaller.py", line 856, in mtvcf_main_analysis qs1.append(median(qs2)) File "/Users/ming/Documents/MToolBox-master/MToolBox/mtVariantCaller.py", line 402, in median m1 = sorted(l)[(len(l)/2)+1-1] IndexError: list index out of range**
l0ka commented 5 years ago

I have the same problem and I'm trying to investigate it, since the maintainers team seems to be disappeared, both here and on the google group.

domenico-simone commented 5 years ago

Hi guys,

sorry for the late reply. The project is still active, although unfortunately we can't put so much time on it. We are working on a major refactoring of the code, in the mean time we'll patch this specific error which seems pretty annoying.

Sorry again and thanks for using MToolBox and for your understanding.

Best,

Domenico

clody23 commented 5 years ago

Hi guys,

can you please share the bam file that generates the error? That would help us debugging.

Many thanks

Best regards, Claudia

Il giorno mer 16 gen 2019 alle ore 19:04 Domenico Simone < notifications@github.com> ha scritto:

Hi guys,

sorry for the late reply. The project is still active, although unfortunately we can't put so much time on it. We are working on a major refactoring of the code, in the mean time we'll patch this specific error which seems pretty annoying.

Sorry again and thanks for using MToolBox and for your understanding.

Best,

Domenico

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mitoNGS/MToolBox/issues/64#issuecomment-454900127, or mute the thread https://github.com/notifications/unsubscribe-auth/AAh5zX_xxDBOd24Zg8OIaHLl8a2go7ILks5vD3fXgaJpZM4Vbu2Q .

-- Claudia

l0ka commented 5 years ago

Hi, thank you for your reply. I've uploaded BAM file and configuration files at https://bitbucket.org/l0ka/mtoolbox_data/src/master/

The BAM file was extracted using BAMQL (https://github.com/BoutrosLaboratory/bamql), starting from a WGS file, (but I get the same error even if I perform the extraction with MToolBox, setting MitoExtraction=true) using the following code:

bamql -I -o test.bam -f input.bam '(chr(M) & mate_chr(M)) | (chr(Y) & after(59000000) & mate_chr(M))'

Then I checked the resulting BAM with picard ValidateSamFile and everything was ok, according to this tool.

I'm using MToolBox on a Linux server (Ubuntu 16.04.3 with Linux 4.4.0-101-generic) and with the default versions of its tools (I've performed the full installation with default parameters).

EDIT: I found the problem is related to GMAP-GSNAP. I updated it to the latest version (2018-07-04, from http://research-pub.gene.com/gmap/) and now MToolBox is working as expected. However I don't know why this problem was related to only some of my BAM files.

clody23 commented 5 years ago

Hi, we just pushed a change to the mtVariantCaller.py script that should fix this.

Can you guys @Confurious @l0ka please do git pull and test MToolBox on your files again and let us know if this works?

Many thanks Claudia

l0ka commented 5 years ago

I downloaded an installed MToolBox again, in another machine. Using the same BAM file, now I get this:

##### SORTING OUT.sam FILES WITH PICARDTOOLS...

[Sat Jan 26 12:27:57 CET 2019] net.sf.picard.sam.SortSam INPUT=OUT.sam OUTPUT=OUT.sam.bam SORT_ORDER=coordinate TMP_DIR=[/test/OUT_test/tmp]    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
INFO    2019-01-26 12:27:57 SortSam Finished reading inputs, merging and writing to output now.
[Sat Jan 26 12:27:57 CET 2019] net.sf.picard.sam.SortSam done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=56623104
Success.

##### REALIGNING KNOWN INDELS WITH GATK INDELREALIGNER...

Realigning known indels for file OUT_test/OUT.sam.bam using /mtoolbox_test/MToolBox/MToolBox//data/MITOMAP_HMTDB_known_indels.chrM as reference...
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.8-1-0-gf15c1c3ef): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name: IndelRealigner
##### ERROR ------------------------------------------------------------------------------------------

The last process reported an error. Exit.

And if I skip GATK I get:

##### SORTING OUT.sam FILES WITH PICARDTOOLS...

[Sat Jan 26 12:29:21 CET 2019] net.sf.picard.sam.SortSam INPUT=OUT.sam OUTPUT=OUT.sam.bam SORT_ORDER=coordinate TMP_DIR=[/test_noGATK/OUT_test/tmp]    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
INFO    2019-01-26 12:29:22 SortSam Finished reading inputs, merging and writing to output now.
[Sat Jan 26 12:29:22 CET 2019] net.sf.picard.sam.SortSam done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=56623104
Success.

Skip Indel Realigner...
Skipping Mark Duplicates...
[Sat Jan 26 12:29:22 CET 2019] net.sf.picard.sam.SamFormatConverter INPUT=OUT.sam.bam.marked.bam OUTPUT=OUT.sam.bam.marked.bam.marked.sam TMP_DIR=[/test_noGATK/OUT_test/tmp]    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sat Jan 26 12:29:22 CET 2019] net.sf.picard.sam.SamFormatConverter done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=56623104

##### ASSEMBLING MT GENOMES WITH ASSEMBLEMTGENOME...

WARNING: values of tail < 5 are deprecated and will be replaced with 5

[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

##### GENERATING VCF OUTPUT...
Reference sequence used for VCF: RCRS

##### PREDICTING HAPLOGROUPS AND ANNOTATING/PRIORITIZING VARIANTS...

Haplogroup predictions based on RSRS Phylotree build 17
Your best results file is  mt_classification_best_results.csv

Loading contig sequences from file test-contigs.fasta
Unable to compute haplogroup. ExitParsing pathogenicity table...
Parsing variability data...
Parsing info about haplogroup-defining sites...
Parsing info about haplogroup assignments...
No annotation.csv found. Exit
l0ka commented 5 years ago

I obtain the same results also using your sim_data:

##### SORTING OUT.sam FILES WITH PICARDTOOLS...

[Sat Jan 26 13:16:51 CET 2019] net.sf.picard.sam.SortSam INPUT=OUT.sam OUTPUT=OUT.sam.bam SORT_ORDER=coordinate TMP_DIR=[/MToolBox/test/sim_data/OUT_simulation100X/tmp]    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
INFO    2019-01-26 13:16:51 SortSam Finished reading inputs, merging and writing to output now.
[Sat Jan 26 13:16:51 CET 2019] net.sf.picard.sam.SortSam done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=56623104
Success.

Skip Indel Realigner...
Skipping Mark Duplicates...
[Sat Jan 26 13:16:51 CET 2019] net.sf.picard.sam.SamFormatConverter INPUT=OUT.sam.bam.marked.bam OUTPUT=OUT.sam.bam.marked.bam.marked.sam TMP_DIR=[/MToolBox/test/sim_data/OUT_simulation100X/tmp]    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sat Jan 26 13:16:52 CET 2019] net.sf.picard.sam.SamFormatConverter done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=56623104

##### ASSEMBLING MT GENOMES WITH ASSEMBLEMTGENOME...

WARNING: values of tail < 5 are deprecated and will be replaced with 5

[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

##### GENERATING VCF OUTPUT...
Reference sequence used for VCF: RCRS

##### PREDICTING HAPLOGROUPS AND ANNOTATING/PRIORITIZING VARIANTS...

Haplogroup predictions based on RSRS Phylotree build 17
Your best results file is  mt_classification_best_results.csv

Loading contig sequences from file simulation100X-contigs.fasta
Unable to compute haplogroup. ExitParsing pathogenicity table...
Parsing variability data...
Parsing info about haplogroup-defining sites...
Parsing info about haplogroup assignments...
No annotation.csv found. Exit

Configuration file is the following

mtdb_fasta=chrM.fa
hg19_fasta=hg19RCRS.fa
mtdb=chrM
humandb=hg19RCRS
input_path=/MToolBox/test/sim_data/
output_name=/MToolBox/test/sim_data
list=test_list.lst
input_type=fastq
ref=RCRS
UseMarkDuplicates=false
UseIndelRealigner=false
MitoExtraction=false
clody23 commented 5 years ago

Dear @l0ka,

the analysis on sim_data is expected to fail on haplogroup predictions, because those are simulated sequences, therefore it is likely that your MToolBox run on sim_data was actually successful.

Could you please test the latest MToolBox version - that you can get by doing a git pull in your local repo (no need to re-install) - on the MToolBox HG00119 example? You can find instructions here: https://github.com/mitoNGS/MToolBox/blob/master/test/HG00119_example/run_mtoolbox_on_test_file.md.

Moreover, I tested the latest fix with the bam file you shared and MToolBox ran smoothly...so I cannot figure out where your error is coming from. These are some of MToolBox files generated from your test.bam, with this latest MToolBox fix:

log_mtoolbox.txt logassemble.txt logmt.txt test.vcf.zip

Also, please not that you can now assign a name to your vcf file by specifying that in your configuration file with the vcf_name argument.

Best regards, Claudia

l0ka commented 5 years ago

I tested the latest MToolBox version (by doing git pull in my local repo ), on the same server I was initially experiencing the errors. Both tests (on HG00119 and on my BAM file) run smoothly.

Thank you for the support!

l0ka commented 5 years ago

I tested MToolBox on almost 700 samples without problems. Thank you again.

domenico-simone commented 5 years ago

Great to hear this!!! Thank you and stay tuned for future updates! Just drop a line for any comment/suggestion!