ocxtal / minialign

[IMPORTANT: not for real data analysis, only for algorithm evaluation] fast and accurate alignment tool for PacBio and Nanopore long reads
MIT License
126 stars 9 forks source link

Errors with Picard with the output of miniAlign #6

Closed danarte closed 7 years ago

danarte commented 7 years ago

Hello, I'm trying to use picard's CollectSummaryMetrics on the output of miniAlign and run into errors:

java -jar picard.jar CollectAlignmentSummaryMetrics R=LambdaRefGenome.fa I=pass.poretools.best.minialign.sam O=testSummary.txt
[Wed Jan 11 15:43:35 IST 2017] picard.analysis.CollectAlignmentSummaryMetrics REFERENCE_SEQUENCE=LambdaRefGenome.fa INPUT=pass.poretools.best.minialign.sam OUTPUT=testSummary.txt    MAX_INSERT_SIZE=100000 EXPECTED_PAIR_ORIENTATIONS=[FR] ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG] METRIC_ACCUMULATION_LEVEL=[ALL_READS] IS_BISULFITE_SEQUENCED=false ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Jan 11 15:43:35 IST 2017] Executing as artemd@compute-0-15.local on Linux 2.6.32-642.6.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15; Picard version: 2.8.1-SNAPSHOT
[Wed Jan 11 15:43:35 IST 2017] picard.analysis.CollectAlignmentSummaryMetrics done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2025848832
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. @RG line missing SM tag. Line:
@RG     ID:1; ; Line number 3
        at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:246)
        at htsjdk.samtools.SAMTextHeaderCodec.access$200(SAMTextHeaderCodec.java:43)
        at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.requireTag(SAMTextHeaderCodec.java:327)
        at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(SAMTextHeaderCodec.java:175)
        at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:108)
        at htsjdk.samtools.SAMTextReader.readHeader(SAMTextReader.java:200)
        at htsjdk.samtools.SAMTextReader.<init>(SAMTextReader.java:63)
        at htsjdk.samtools.SAMTextReader.<init>(SAMTextReader.java:73)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:359)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:169)
        at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:89)
        at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:77)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

The error is "Error parsing SAM header. @RG line missing SM tag", I tried searching and found that the SM tag is not mandatory and I can get rid of that error with "VALIDATION_STRINGENCY=LENIENT"

I tried that but still run into errors:

...
Ignoring SAM validation error: ERROR: Error parsing SAM header. @RG line missing SM tag. Line:
@RG     ID:1
WARNING 2017-01-11 15:48:03     SinglePassSamProgram    File reports sort order 'unsorted', assuming it's coordinate sorted anyway.
[Wed Jan 11 15:48:03 IST 2017] picard.analysis.CollectAlignmentSummaryMetrics done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2025848832
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 37
        at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector$IndividualAlignmentSummaryMetricsCollector.collectQualityData(AlignmentSummaryMetricsCollector.java:323)
        at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector$IndividualAlignmentSummaryMetricsCollector.addRecord(AlignmentSummaryMetricsCollector.java:189)
        at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector.acceptRecord(AlignmentSummaryMetricsCollector.java:121)
        at picard.analysis.AlignmentSummaryMetricsCollector$GroupAlignmentSummaryMetricsPerUnitMetricCollector.acceptRecord(AlignmentSummaryMetricsCollector.java:87)
        at picard.metrics.MultiLevelCollector$AllReadsDistributor.acceptRecord(MultiLevelCollector.java:192)
        at picard.metrics.MultiLevelCollector.acceptRecord(MultiLevelCollector.java:315)
        at picard.analysis.AlignmentSummaryMetricsCollector.acceptRecord(AlignmentSummaryMetricsCollector.java:83)
        at picard.analysis.CollectAlignmentSummaryMetrics.acceptRead(CollectAlignmentSummaryMetrics.java:147)
        at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:138)
        at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:77)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

(I did sort the file) So now the error is ArrayIndexOutOfBoundsException error and somehow related to collecting q-score data. How can I solve this? (I run into similar problem when a sam file had * instead of qscore data but picard doc says it should be able to handle abscence of qscore data)

Here is the start of the sam file (replaced long strings with ---):

@HD     VN:1.5  SO:coordinate
@SQ     SN:burn-in      LN:48502
@RG     ID:1
@PG     ID:6    PN:minialign
0034a196-edbc-429f-89c4-b5280a486760_Basecall_2D_2d   0   burn-in 1   60   21S21M1D---1M21S   *   0   0   ACTACG---GAAG   *  RG:Z:1
06c0ff36-09df-4bb3-b952-146fca6f60ae_Basecall_2D_2d   0   burn-in 1   60   31S21M1D---2M31S   *   0   0   TACAAC---TCGG   *  RG:Z:1
1a558c5f-40d0-4bec-9777-30c14f651acd_Basecall_2D_2d   0   burn-in 1   60   21S36M1D---6M24S   *   0   0   AACGTT---GAAA   *  RG:Z:1

(and here is a start of a bam file from bwa alignment for example, it worked fine with picard):

@HD     VN:1.5  SO:coordinate
@SQ     SN:burn-in      LN:48502
@PG     ID:bwa  PN:bwa  VN:0.7.15-r1142-dirty   CL:bwa mem -x ont2d -t 12 ../LambdaRefGenome.fa /genomicslab/nobackup/volume2/artemd/minion/third_lambda_run/reads/DiffExtractionAndAligners/./pass.poretools.best.fastq
0034a196-edbc-429f-89c4-b5280a486760_Basecall_2D_2d  0   burn-in 1   60   21S1---M21S   *   0   0   ACTAC---TGAAG   $$$%---,*,($   MD:Z:12---11G31   NM:i:138    AS:i:1920   XS:i:0
06c0ff36-09df-4bb3-b952-146fca6f60ae_Basecall_2D_2d  0   burn-in 1   60   31S1---M31S   *   0   0   TACAA---TTCGG   ""%)---+*+&#   MD:Z:18---^AA42   NM:i:371    AS:i:5836   XS:i:0
ocxtal commented 7 years ago

Hi, @danarte

Thank you for your bug report. I've briefly tested picard (master on github) and confirmed that it requires the RG line with the SM option accompanied (and also quality string present in each record). So I added a quality string sustaining option '-Q' and a RG line injection option '-R' (the same as the bwa-mem's -R option) in the 0.4.3 release. Just giving -Q to minialign (or -Q -R"@RG\tID:1\tSM:foo" or something, if you need RG tags) works[ed] well with the CollectAlignmentSummaryMetrics subcommand.

Thanks,

Hajime Suzuki

danarte commented 7 years ago

Wow, great quick answer and a solution. Just had a chance to check it and it works. Quick note: Index file needed to be rebuilt for the new minialign version, so I think it would be a good idea to add this to the changelog.

Thank you.

ocxtal commented 7 years ago

Thank you for your helpful comment! I'll fix it as soon as possible.💪

Hajime Suzuki