ole-tange / barmaids

0 stars 0 forks source link

bamdamage: need to rewrite this script. it contains notably a bug with regard to Ns. #65

Open sapfo opened 10 years ago

sapfo commented 10 years ago

Create a dist for bamdamage.

ole-tange commented 10 years ago

Where is the newest version of bamdamage?

Do we have a man page for it - if not: Create on based on bammds.pod.

Include an example of how to run it.

sapfo commented 10 years ago

required scripts:

Archive/mdsdamage.dir/plotsinglestat.r Archive/mdsdamage.dir/quickstat.pl

how to run:

if($::opt_stat) { set_params_ALL();

outfile stat

$G::statpdf = $G::outfile || $base.".stat.pdf"; my @cmd = qq{samtools view -h $G::bam | quickstat.pl - $G::bam -ml $G::mapq -sl $G::ntq -so 33 -ll 0 -lu 100 -du 20 -pi 80 -su 50; plotsinglestat.r ${G::bam}.stat $G::samplename $G::statpdf }; ::debug(@cmd,"\n"); system(@cmd); exit($?);

}

manual:

badamage: generate the plot for the read length distribution, damage distribution, 5' damage pattern, 3' damage pattern in a pdf-file.

=item B<--mapquality> I =item B<-m> I Skip reads with mapping quality smaller than I. Default: 30.

=item B<--ntquality> I =item B<-n> I Skip bases with base quality smaller than I. Default: 20.

=item B<--samplename> I =item B<-s> I

Use I as the name for the sample in plots. Default is AncientSample.

sapfo commented 10 years ago

Pod file added where suggested.

Location of the scripts (not under Archive): plotsinglestat.r quickstat.pl I made a few changes to plotsinglestat.r after a bug report by Morten R.

Added the example that can be downloaded under: contamination/webpage

ole-tange commented 10 years ago

Implemented in [342]. Ready for testing.

sapfo commented 10 years ago

Hi Ole, I have been working on this new data they want us the include, sorry it has taken a while. The data file is small. So i put it under testinputfiles/Gus.merge.Shotgun.GRCh37.q30.Q0.flt.sort.rmdup.realign.md.bam

There is one thing i dont get: when running bamdamage Gus.merge.Shotgun.GRCh37.q30.Q0.flt.sort.rmdup.realign.md.bam

I get: bamdamage: Warning: The bam file contains unexpected MD strings. Use -v to see them

Now when using the old pipeline, i do not get such a warning (i know you modified the warning message, but in fact i get no warning at all).

I think this might be due to me being stupid and having several versions of quickstat.pl around. I added the version used under my computer with a different name: quickstat_old.pl This one should gives as output: 10000 lines processed 20000 lines processed 30000 lines processed 40000 lines processed 50000 lines processed 60000 lines processed I think.

Could you perhaps look into it? thanks.

ole-tange commented 10 years ago

quickstat_old.pl also produces a .log file. We do not do that: We simply print the log on screen.

So if you look in the .log file, you will see all the "unexpected MD string".

sapfo commented 10 years ago

ok. thanks. thats really odd then: all the bam files i tries have this warning output - so thats probably a small bug in previous quickstat. i just had not seen it.

could you help me out figure it whats expected? or should we ask Yong? seems to me in there:

while(substr($cigar_expanded, $pos, 1) ne "M") {
$pos++;
$refres=uc(substr($ref,$pos,1));
}
$refres=uc($curstr);
unless ($refres=~/[ACGT]/) {
print LOG "unexpected MD string $mismstrbk $curstr $refres\n";

exit;

                    }  
                    $readres=uc(substr($read,$pos,1));  
                    unless ($readres=~/[ACGT]/) {  
                            print LOG "unexpected MD string $mismstrbk $read $readres\n";  
                            #exit;                                                          

is it possible it is when there is an N character? the reads can contain ACGTN characters

mortenr commented 10 years ago

has these bam files been through GATK indelRealign? (and later samtools calmd) (i.e. Simon’s default mapping pipeline)??

This most likely explains why they don’t match the pattern expected, as indelrealign + samtools calmd adds N’s and potentially other characters to the cigar.

Morten

On Mar 18, 2014, at 23:30, sapfo notifications@github.com wrote:

ok. thanks. thats really odd then: all the bam files i tries have this warning output - so thats probably a small bug in previous quickstat. i just had not seen it.

could you help me out figure it whats expected? or should we ask Yong? seems to me in there:

while(substr($cigar_expanded, $pos, 1) ne "M") {

$pos++;

$refres=uc(substr($ref,$pos,1));

}

$refres=uc($curstr);

unless ($refres=~/[ACGT]/) {

print LOG "unexpected MD string $mismstrbk $curstr $refres\n";

exit;

}

$readres=uc(substr($read,$pos,1));

unless ($readres=~/[ACGT]/) {

print LOG "unexpected MD string $mismstrbk $read $readres\n";

exit;

is it possible it is when there is an N character? the reads can contain ACGTN characters

— Reply to this email directly or view it on GitHub.

mortenr commented 10 years ago

see example of samtools calmd stderr output below, it basically changes NM and MD fields.

[bam_fillmd1] different NM for read 'HWI-ST897:237:C2NK0ACXX:2:1203:11578:19032': 2 -> 3 [bam_fillmd1] different MD for read 'HWI-ST897:237:C2NK0ACXX:2:1203:11578:19032': '5G14^C45' -> '5G14^C0N44' [bam_fillmd1] different NM for read 'HWI-ST897:237:C2NK0ACXX:1:2208:14301:75036': 1 -> 2 [bam_fillmd1] different MD for read 'HWI-ST897:237:C2NK0ACXX:1:2208:14301:75036': '15^C67' -> '15^C0N66' [bam_fillmd1] different NM for read 'HWI-ST897:237:C2NK0ACXX:2:2203:9389:71136': 3 -> 4 [bam_fillmd1] different MD for read 'HWI-ST897:237:C2NK0ACXX:2:2203:9389:71136': '29C2G14^C17' -> '29C2G14^C0N16' [bam_fillmd1] different NM for read 'HWI-ST897:237:C2NK0ACXX:2:1208:10346:96870': 2 -> 3 [bam_fillmd1] different MD for read 'HWI-ST897:237:C2NK0ACXX:2:1208:10346:96870': '33^C20G4' -> '33^C0N19G4' [bam_fillmd1] different NM for read 'HWI-ST897:237:C2NK0ACXX:1:1206:19650:52915': 1 -> 2 [bam_fillmd1] different MD for read 'HWI-ST897:237:C2NK0ACXX:1:1206:19650:52915': '9^C46' -> '9^C0N45' [bam_fillmd1] different MD for read 'HWI-ST897:237:C2NK0ACXX:3:2105:5530:34435': '0A29G6T21' -> '0N29G6T21' [bam_fillmd1] different NM for read 'HWI-ST897:237:C2NK0ACXX:3:1303:11580:63473': 3 -> 4 [bam_fillmd1] different MD for read 'HWI-ST897:237:C2NK0ACXX:3:1303:11580:63473': '20C2^C41C0' -> '20C2^C0N40C0'

On Mar 18, 2014, at 23:30, sapfo notifications@github.com wrote:

ok. thanks. thats really odd then: all the bam files i tries have this warning output - so thats probably a small bug in previous quickstat. i just had not seen it.

could you help me out figure it whats expected? or should we ask Yong? seems to me in there:

while(substr($cigar_expanded, $pos, 1) ne "M") {

$pos++;

$refres=uc(substr($ref,$pos,1));

}

$refres=uc($curstr);

unless ($refres=~/[ACGT]/) {

print LOG "unexpected MD string $mismstrbk $curstr $refres\n";

exit;

}

$readres=uc(substr($read,$pos,1));

unless ($readres=~/[ACGT]/) {

print LOG "unexpected MD string $mismstrbk $read $readres\n";

exit;

is it possible it is when there is an N character? the reads can contain ACGTN characters

— Reply to this email directly or view it on GitHub.

sapfo commented 10 years ago

ok, thanks! Thats exactly it.

That sucks a little bit though as it seems to be standard procedure...?

Do you have an ideas around it? Maybe i should look up the calmd thing...

Morten do you have any input if we should leave in all of those as output:

  1. sequence quality distribution
  2. read length distr
  3. damage distribution
  4. damage pattern from 5 end
  5. damage pattern from 3 end

I never look at 1 and 3...but maybe i should? i cant quite remember what 3 was supposed to be...

On Wed, Mar 19, 2014 at 7:35 AM, Morten Rasmussen notifications@github.comwrote:

has these bam files been through GATK indelRealign? (and later samtools calmd) (i.e. Simon's default mapping pipeline)??

This most likely explains why they don't match the pattern expected, as indelrealign + samtools calmd adds N's and potentially other characters to the cigar.

Morten

On Mar 18, 2014, at 23:30, sapfo notifications@github.com wrote:

ok. thanks. thats really odd then: all the bam files i tries have this warning output - so thats probably a small bug in previous quickstat. i just had not seen it.

could you help me out figure it whats expected? or should we ask Yong? seems to me in there:

while(substr($cigar_expanded, $pos, 1) ne "M") {

$pos++;

$refres=uc(substr($ref,$pos,1));

}

$refres=uc($curstr);

unless ($refres=~/[ACGT]/) {

print LOG "unexpected MD string $mismstrbk $curstr $refres\n";

exit;

}

$readres=uc(substr($read,$pos,1));

unless ($readres=~/[ACGT]/) {

print LOG "unexpected MD string $mismstrbk $read $readres\n";

exit;

is it possible it is when there is an N character? the reads can contain ACGTN characters

Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHubhttps://github.com/ole-tange/barmaids/issues/65#issuecomment-38021720 .

mortenr commented 10 years ago

yeah, not sure how to fix that, the cigar string get’s a fair bit more complicated, so I think just adding ’N’ to the character class would cause more issues than it fixes.

I also never look at 1 and 3. I’ve never found a use for 3, I guess 1 can potentially be useful if it was ever to show an unexpected distribution (it seems unlikely, as reads with low quality are less likely to map). I think plot1 makes more sense to evaluate before mapping, the plot is basically replicated in the standard fastqc output. So in summary I’d agree to remove 1 and 3.

Morten

On Mar 18, 2014, at 23:49, sapfo notifications@github.com wrote:

ok, thanks! Thats exactly it.

That sucks a little bit though as it seems to be standard procedure...?

Do you have an ideas around it? Maybe i should look up the calmd thing...

Morten do you have any input if we should leave in all of those as output:

  1. sequence quality distribution
  2. read length distr
  3. damage distribution
  4. damage pattern from 5 end
  5. damage pattern from 3 end

I never look at 1 and 3...but maybe i should? i cant quite remember what 3 was supposed to be...

On Wed, Mar 19, 2014 at 7:35 AM, Morten Rasmussen notifications@github.comwrote:

has these bam files been through GATK indelRealign? (and later samtools calmd) (i.e. Simon's default mapping pipeline)??

This most likely explains why they don't match the pattern expected, as indelrealign + samtools calmd adds N's and potentially other characters to the cigar.

Morten

On Mar 18, 2014, at 23:30, sapfo notifications@github.com wrote:

ok. thanks. thats really odd then: all the bam files i tries have this warning output - so thats probably a small bug in previous quickstat. i just had not seen it.

could you help me out figure it whats expected? or should we ask Yong? seems to me in there:

while(substr($cigar_expanded, $pos, 1) ne "M") {

$pos++;

$refres=uc(substr($ref,$pos,1));

}

$refres=uc($curstr);

unless ($refres=~/[ACGT]/) {

print LOG "unexpected MD string $mismstrbk $curstr $refres\n";

exit;

}

$readres=uc(substr($read,$pos,1));

unless ($readres=~/[ACGT]/) {

print LOG "unexpected MD string $mismstrbk $read $readres\n";

exit;

is it possible it is when there is an N character? the reads can contain ACGTN characters

Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHubhttps://github.com/ole-tange/barmaids/issues/65#issuecomment-38021720 .

— Reply to this email directly or view it on GitHub.

mortenr commented 10 years ago

I’d also consider plotting 4+5 next to each other, as they are often useful to compare. This also reduces output to two pages. Morten

On Mar 18, 2014, at 23:49, sapfo notifications@github.com wrote:

ok, thanks! Thats exactly it.

That sucks a little bit though as it seems to be standard procedure...?

Do you have an ideas around it? Maybe i should look up the calmd thing...

Morten do you have any input if we should leave in all of those as output:

  1. sequence quality distribution
  2. read length distr
  3. damage distribution
  4. damage pattern from 5 end
  5. damage pattern from 3 end

I never look at 1 and 3...but maybe i should? i cant quite remember what 3 was supposed to be...

On Wed, Mar 19, 2014 at 7:35 AM, Morten Rasmussen notifications@github.comwrote:

has these bam files been through GATK indelRealign? (and later samtools calmd) (i.e. Simon's default mapping pipeline)??

This most likely explains why they don't match the pattern expected, as indelrealign + samtools calmd adds N's and potentially other characters to the cigar.

Morten

On Mar 18, 2014, at 23:30, sapfo notifications@github.com wrote:

ok. thanks. thats really odd then: all the bam files i tries have this warning output - so thats probably a small bug in previous quickstat. i just had not seen it.

could you help me out figure it whats expected? or should we ask Yong? seems to me in there:

while(substr($cigar_expanded, $pos, 1) ne "M") {

$pos++;

$refres=uc(substr($ref,$pos,1));

}

$refres=uc($curstr);

unless ($refres=~/[ACGT]/) {

print LOG "unexpected MD string $mismstrbk $curstr $refres\n";

exit;

}

$readres=uc(substr($read,$pos,1));

unless ($readres=~/[ACGT]/) {

print LOG "unexpected MD string $mismstrbk $read $readres\n";

exit;

is it possible it is when there is an N character? the reads can contain ACGTN characters

Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHubhttps://github.com/ole-tange/barmaids/issues/65#issuecomment-38021720 .

— Reply to this email directly or view it on GitHub.

sapfo commented 10 years ago

Ok then here is the final suggestion: A) update warning to "The bam file contains unexpected MD strings. Use -v to see them. If you have used GATK to realign your reads followed by samtools calmd, this is to be expected.\n" B) Remove page 1 and 3 from the output C) Merge the last two pages into one. I.e. at the end, two pages of output.

Ole: I looked into modifying src/bamdamage for A) but it tells it is not version controlled. Is that possible or i am doing something wrong?

ole-tange commented 10 years ago

src/bamdamage is generated by src/bamdamage.pl - so do the change in src/bamdamage.pl.

sapfo commented 10 years ago

Ok, all done. except installing it on eos....which i forgot how i should do....

sapfo commented 10 years ago

ok. so actually this warning is also outputted when not using GATK+calmd.

The cigar string can contain Ns regardless or?

Every read that has an N looks like an unexpected MD string.

so i changed: unless ($refres=~/[ACGT]/) to unless ($refres=~/[ACGTN]/)

I think has implication for the computation perhaps...i.e. the rates could be affected, potential bug.

i really dont feel like reading more perl today.