stschiff / sequenceTools

Other
39 stars 10 forks source link

pileupCaller: ParsingError #3

Closed SRgen closed 5 years ago

SRgen commented 5 years ago

Hi

I'm using pileupCaller (sequenceTools version 1.2.2) in order to get .geno from .bam. Since I have several sequences I'm looping them in this way:

for i in *bam; do samtools mpileup -B -q30 -Q30 -f ref.fa $i | pileupCaller --sampleNames $i -f snplist -o EigenStrat -e $i; done

It was working fine until it started giving me this ParsingError:
pileupCaller: ParsingError {peContexts = [], peMessage = "Failed reading: satisfy Error occurred while trying to parse this chunk: \"hs37d5\t31498775\tR\t1\tA\tY\nhs37d5\t31498776\tG\t1\t.\t]\nhs37d5\t31498777\tA\t1\t.\t]\nhs37d5\t31498778\tC\t1\t.\t]\nhs37d5\t31498779\tA\t1\t.\tY\nhs37d5\t31498780\tG\t1\t.\t]\nhs37d5\t31498781\tT\t1\t.\t]\nhs37d5\t31498782\tG\t1\t.\t]\nhs37d5\t31498783\tC\t1\t.\t]\nhs37d5\t\""}

I already double check and I'm using the same reference described in the paper and presented in the bam header.

Am I doing something wrong? Any suggestions would be very helpful.

Thanks

stschiff commented 5 years ago

The error occurs because you have an “R” in the reference (at position hs37d5:31498775). What chromosome is this, and why is it called “hs37d5”, which is the name of the genome?

Here is the chunk that it fails to parse formatted:

hs37d5 31498775 R 1 A Y hs37d5 31498776 G 1 . ] hs37d5 31498777 A 1 . ] hs37d5 31498778 C 1 . ] hs37d5 31498779 A 1 . Y hs37d5 31498780 G 1 . ] hs37d5 31498781 T 1 . ] hs37d5 31498782 G 1 . ] hs37d5 31498783 C 1 . ] hs37d5

As you can see, there is an R in the first position listed. It expects A,C,T,G or N.

I am happy to change the parser to include R if I get convinced that an R as reference allele is not a bug. I have never seen an R, so I think it’s correct that if failed to parse this.

Stephan

On 4 Feb 2019, at 13:17, SRgen notifications@github.com wrote:

Hi

I'm using pileupCaller (sequenceTools version 1.2.2) in order to get .geno from .bam. Since I have several sequences I'm looping them in this way:

for i in *bam; do samtools mpileup -B -q30 -Q30 -f ref.fa $i | pileupCaller --sampleNames $i -f snplist -o EigenStrat -e $i; done It was working fine until it started giving me this ParsingError: pileupCaller: ParsingError {peContexts = [], peMessage = "Failed reading: satisfy Error occurred while trying to parse this chunk: "hs37d5\t31498775\tR\t1\tA\tY\nhs37d5\t31498776\tG\t1\t.\t]\nhs37d5\t31498777\tA\t1\t.\t]\nhs37d5\t31498778\tC\t1\t.\t]\nhs37d5\t31498779\tA\t1\t.\tY\nhs37d5\t31498780\tG\t1\t.\t]\nhs37d5\t31498781\tT\t1\t.\t]\nhs37d5\t31498782\tG\t1\t.\t]\nhs37d5\t31498783\tC\t1\t.\t]\nhs37d5\t""}

I already double check and I'm using the same reference described in the paper and presented in the bam header.

Am I doing something wrong? Any suggestions would be very helpful.

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stschiff/sequenceTools/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbQmuRa7fkAmbZFFnfswZP9yPs2Z8qtks5vKCTTgaJpZM4ahFbH.

SRgen commented 5 years ago

Hi Stephan

Thanks for the reply.

hs37d5 is GRCh37 + rCRS + Human herpesvirus 4 type 1 + concatenated decoy sequences (hs37d5). I checked and it has 2 Rs in chr3 and a bunch of non ACGTN IUPAC code nucleotides in it.

Are you suggesting that if I change this nucleotides to Ns the parser should be fine? Another question: Even if the parser giver the error is the .geno until that point usable?

Thanks

Simao

stschiff commented 5 years ago

Hi Simao,

I know what hs37d5 is, of course, but I don’t understand why it appears in the first column of the pileup. It should be the chromosome name in that column, I think.

The second thing I don’t understand is why you seem to stream every site. The kind of “calling” that pileupCaller provides is designed to work on a defined set of positions, like the positions in some defined SNP Chip. So it would be much more efficient to pass a “-l” option to samtools mpileup to restrict the output to a defined set of positions. That way you would probably also avoid these weird R positions.

To answer your question though: Yes, if you replace the R’s by N’s, then it should work. However, given your explanation that there are actually R’s in the reference genome, I’m considering allowing additional letters in the parser. I will investigate and keep you posted.

Best, Stephan

On 4 Feb 2019, at 18:00, SRgen notifications@github.com wrote:

Hi Stephan

Thanks for the reply.

hs37d5 is GRCh37 + rCRS + Human herpesvirus 4 type 1 + concatenated decoy sequences (hs37d5). I checked and it has 2 Rs in chr3 and a bunch of non ACGTN IUPAC code nucleotides in it.

Are you suggesting that if I change this nucleotides to Ns the parser should be fine? Another question: Even if the parser giver the error is the .geno until that point usable?

Thanks

Simao

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stschiff/sequenceTools/issues/3#issuecomment-460326332, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbQmmZQ5q6gANJwOCkF7erDskHlEXXIks5vKGchgaJpZM4ahFbH.

SRgen commented 5 years ago

Thank you Stephan

I can now consider this issue resolved. I'm doing both -l to filter some unwanted regions and changing to Ns since a reference shouldn't have them. If you want to change the parser it would be great but be aware that this reference, (https://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/reference_genomes.html) also has Y, S, W,K and M.

Once again thanks for your help

Simao

Arkkienkeli commented 5 years ago

Hello, I also got this error, but I have proper chromosome names:

"3\t60830763\tR\t2\tAA\tJJ\n3\t60830764\tR\t2\tAA\tJJ\n3\t60830765\tG\t2\t..\tJJ\n3\t60830766\tC\t2\t..\tJJ\n3\t60830767\tT\t2\t..\tJJ\n3\t60830768\tT\t2\t..\tJJ\n3\t60830769\tG\t2\t..\tJJ\n3\t60830770\tG\t2\t..\tJJ\n3\t60830771\tT\t2\t..\tJJ\n3\t60830772\tT\t2\t..\tJJ\n3\t60830773\tC\t2\t..\tJJ\n3\t60830774\tT\t2\t..\tJJ\n3\t60830775\tA\t2\t..\tJJ\n3\t60830776\tA\t2\t..\tJJ\n3\t60830777\tC\t2\t..\tJJ\n3\t60830778\tA\t2\t..\tJJ\n3\t60830779\tA\t2\t..\tJJ\n3\t60830780\tT\t2\t..\tJJ\n3\t60830781\tG\t2\t..\tJJ\n3\t60830782\tA\t2\t..\tJJ\n3\t60830783\tA\t2\t..\tJJ\n3\t60830784\tT\t2\t..\tJJ\n3\t60830785\tT\t2\t..\tJJ\n3\t60830786\tT\t2\t..\tJJ\n3\t60830787\tA\t2\t..\tJJ\n3\t60830788\tA\t2\t..\tJJ\n3\t60830789\tT\t2\t..\tJJ\n3\t60830790\tA\t2\t..\tJJ\n3\t60830791\tA\t2\t..\tJJ\n3\t60830792\tG\t2\t..\tJJ\n3\t60830793\tA\t2\t..\tJJ\n3\t60830794\tA\t2\t..\tJJ\n3\t60830795\tT\t2\t..\tJJ\n3\t60830796\tT\t2\t..\tJJ\n3\t60830797\tG\t2\t..\tJJ\n3\t60830798\tT\t2\t..\tJJ\n3\t60830799\tA\t2\t..\tJJ\n3\t60830800\tT\t2\t.$.$\tJJ\n3\t60831011\tA\t1\t^\""

stschiff commented 5 years ago

Hi Nick,

this isn’t about chromosome names, but about the “R” appearing as the reference allele in column three. I only allow A,C,T,G,N in place of the reference allele. As a quick fix, you can convert those R’s to N’s, with something like

samtools mileup … | awk ‘$3==“R" {$3=“N"; print} {print}’ | pileupCaller ...

I haven’t yet fixed this in other terms, but have put a note on my (much too long) To-Do list.

Best, Stephan

On 15 Apr 2019, at 22:17, Nick notifications@github.com wrote:

Hello, I also got this error, but I have proper chromosome names:

"3\t60830763\tR\t2\tAA\tJJ\n3\t60830764\tR\t2\tAA\tJJ\n3\t60830765\tG\t2\t..\tJJ\n3\t60830766\tC\t2\t..\tJJ\n3\t60830767\tT\t2\t..\tJJ\n3\t60830768\tT\t2\t..\tJJ\n3\t60830769\tG\t2\t..\tJJ\n3\t60830770\tG\t2\t..\tJJ\n3\t60830771\tT\t2\t..\tJJ\n3\t60830772\tT\t2\t..\tJJ\n3\t60830773\tC\t2\t..\tJJ\n3\t60830774\tT\t2\t..\tJJ\n3\t60830775\tA\t2\t..\tJJ\n3\t60830776\tA\t2\t..\tJJ\n3\t60830777\tC\t2\t..\tJJ\n3\t60830778\tA\t2\t..\tJJ\n3\t60830779\tA\t2\t..\tJJ\n3\t60830780\tT\t2\t..\tJJ\n3\t60830781\tG\t2\t..\tJJ\n3\t60830782\tA\t2\t..\tJJ\n3\t60830783\tA\t2\t..\tJJ\n3\t60830784\tT\t2\t..\tJJ\n3\t60830785\tT\t2\t..\tJJ\n3\t60830786\tT\t2\t..\tJJ\n3\t60830787\tA\t2\t..\tJJ\n3\t60830788\tA\t2\t..\tJJ\n3\t60830789\tT\t2\t..\tJJ\n3\t60830790\tA\t2\t..\tJJ\n3\t60830791\tA\t2\t..\tJJ\n3\t60830792\tG\t2\t..\tJJ\n3\t60830793\tA\t2\t..\tJJ\n3\t60830794\tA\t2\t..\tJJ\n3\t60830795\tT\t2\t..\tJJ\n3\t60830796\tT\t2\t..\tJJ\n3\t60830797\tG\t2\t..\tJJ\n3\t60830798\tT\t2\t..\tJJ\n3\t60830799\tA\t2\t..\tJJ\n3\t60830800\tT\t2\t.$.$\tJJ\n3\t60831011\tA\t1\t^""

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stschiff/sequenceTools/issues/3#issuecomment-483402414, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbQmq37KPORAcyykjNM_rMRSF5ls9C8ks5vhN5kgaJpZM4ahFbH.

Arkkienkeli commented 5 years ago

Thank you Stephan, it did work.