stschiff / sequenceTools

Other
39 stars 10 forks source link

pileupCaller: SeqFormatException "Error while parsing #8

Closed Anahit19 closed 4 years ago

Anahit19 commented 4 years ago

Hi,

I am using pileupcaller to sample alleles from my low-coverage bam files. First, I created pileup.txt file with samtools mpileup and then EIGENSTRAT.snp file. However, when I am running the pileupcaller with my command line, I have the following issue:

-bash-4.2$ pileupCaller --randomHaploid --sampleNames NE --samplePopName NE -f EIGEN STRAT_trial.snp -e NE_hapl < pileup_trial.txt pileupCaller: SeqFormatException "Error while parsing: Failed reading: satisfyWith. Error occurred when try ing to parse this chunk: \"rs9992 1 0.0 9992\nrs9993 1 0.0 9993\nrs9994 1 0.0 9994\nrs9995 1 0.0 9995\n rs9996 1 0.0 9996\nrs9997 1 0.0 9997\nrs9998 1 0.0 9998\nrs9999 1 0.0 9999\nrs10000 1 0.0 10000\nrs100 01 1 0.0 10000 \t\""

Here is my pileup_trial.txt and EIGENSTAT.snp files:

1 9992 N 0 1 9993 N 0 1 9994 N 1 C C 1 9995 N 1 T F 1 9996 N 2 TT FC 1 9997 N 2 CC FF 1 9998 N 3 CCc FFB 1 9999 N 3 GGg FFD 1 10000 N 3 AAa HFE 1 10001 T 3 .+1C.,+1c HFC

rs9992 1 0.0 9992 rs9993 1 0.0 9993 rs9994 1 0.0 9994 rs9995 1 0.0 9995 rs9996 1 0.0 9996 rs9997 1 0.0 9997 rs9998 1 0.0 9998 rs9999 1 0.0 9999 rs10000 1 0.0 10000 rs10001 1 0.0 10000

Could you please help with this issue.

Thanks

Anahit

stschiff commented 4 years ago

Your Eigenstrat SNP file needs six columns, including the reference and alternate allele at each position. I've updated the readme to make that clear.

Anahit19 commented 4 years ago

Thank you!

I am working on bam.files and I can extract reference state from my pileup.txt file, but how to extract alternative state?

stschiff commented 4 years ago

Yes, I realise this is an issue. I would say this: 1) For most applications of pileupCaller, you're not attempting to do "de-novo-calling", so you're mostly trying to pull down on a defined set of SNPs. In most contexts, you have reference information for that SNP list. For example, a genotyping dataset with other individuals where you already know what the two alleles are you're looking for. In case of human data, even if you want to call "everywhere", you would probably choose the 1000 Genomes data to determine what the two alleles you want to call are. 2) I might re-implement an option to use pileupCaller to call "everywhere" and determine itself what the two alleles are. But I don't like that too much because it might be misused then by people not knowing what they can and cannot do with low-coverage sequencing data.

Anahit19 commented 4 years ago

Thank you for your help!

Anahit19 commented 4 years ago

Hi Stephan,

Two low-coverage bam files were successfully processed, but the third one keeps giving me an error on one of the chromosomes (attached).

What could be the issue?

Thanks again,

Anahit

output.txt

stschiff commented 4 years ago

Sorry for the terrible error format. It tells you that it got problems parsing this chunk of the (pileup) input:

3   60830763    R   1   A   I
3   60830764    R   1   A   J
3   60830765    G   1   .   J
3   60830766    C   1   .   J
3   60830767    T   1   .   J
3   60830768    T   1   .   J
3   60830769    G   1   .   I
3   60830770    G   1   .   J
3   60830771    T   1   .   J
3   60830772    T   1   .   I
3   60830773    C   1   .   J
3   60830774    T   1   .   J
3   60830775    A   1   .   F
3   60830776    A   1   .   I
3   60830777    C   1   .   J
3   60830778    A   1   .   J
3   60830779    A   1   .   J
3   60830780    T   1   .   I
3   60830781    G   1   .   J

As you can see, you have an "R" in place of the reference allele, which it doesn't accept. I could change that, but I'm not really comfortable with it, because it's actually quite unclear what an R in the reference genome means. I would suggest, as a quick fix, you change these to an N using a simple sed pipeline, such as

samtools pileup ... | awk '{if($3=="R")$3="N"; print}' | pileupCaller ...

or if you're reading the pileup data from a file:

awk '{if($3=="R")$3="N"; print}' pileup.txt | pileupCaller ...

I've not tested this awk script, please check before you apply it.

Anahit19 commented 4 years ago

Thanks a lot!

It is only surprising that I did not have this issue with the two other bams, though I was using the same reference. Probably these sites are not covered.

Thanks again

Anahit

stschiff commented 4 years ago

Yes, that is surprising. And I think the only explanation is that your other two individuals had no data in these positions, so they weren't output. Note that an alternative way is to restrict the pileup to the positions in your eigenstrat positions file in the first place (which hopefully doesn't contain R alleles in the reference). That way, pileupCaller doesn't even have to parse those. You can do that in samtools.

Anahit19 commented 4 years ago

Thank you, Stephan!