stschiff / sequenceTools

Other
40 stars 10 forks source link

pileupCaller: WrongInputOrderException "ordering violated: #30

Closed nullquine closed 1 year ago

nullquine commented 1 year ago

Hi!

I am trying to figure out the exact nature of this issue and I would like to request some help.

Upon attempting to generate eigenstrat from a collection of BAMs, I repeatedly encounter the following error message:

pileupCaller: WrongInputOrderException "ordering violated: PileupRow {pileupChrom = 19, pileupPos = 62678347, pileupRef = 'G', pileupBases = [\"GG\",\"GGG\",\"G\",\"\",\"\"], pileupStrandInfo = [[ReverseStrand,ForwardStrand],[ReverseStrand,ReverseStrand,ReverseStrand],[ReverseStrand],[],[]]} should come after PileupRow {pileupChrom = 1, pileupPos = 233, pileupRef = 'G', pileupBases = [\"GGGGGGG\",\"GG\",\"G\",\"\",\"\"], pileupStrandInfo = [[ForwardStrand,ForwardStrand,ForwardStrand,ForwardStrand,ForwardStrand,ForwardStrand,ForwardStrand],[ForwardStrand,ForwardStrand],[ForwardStrand],[],[]]}"

(this is the output of a smaller test run on 5 BAMs, the issue is the same on larger volume)

To my best understanding, and based on the references from other issues, the cause is the chromosomes being in a lexical order instead of karyotypic. I reordered the reference FASTA and the positions list file used in the reference pipeline, but the problem remains. The SNP files are lexicographically ordered.

I am using horse genomic data, in case it is relevant (of course with the adequate references and positions list)

Any ideas how to resolve this?

Thank you in advance and please let me know if you need additional information.

Best: Kornel

stschiff commented 1 year ago

Hi Kornel.

The error definitely comes from the fact that in your pileup input, a position on chromosome 1 comes after a position on chromosome 19. The software needs to assume an ordering, so that it can "weave" together pileup- and snp positions. The assumption is that chromosome names are either numerical (1,2,3,...) or numerical prepended with chr (chr1, chr2, chr3...). The only non-numeric names I deal with are X (which is automatically translated to 23) and Y (which is translated to 24) and MT which is translated to 90.

I fully recognise that these conventions are geared towards human data. If you could tell me what the situation is with horses, I might be able to come up with something that caters for that case.

nullquine commented 1 year ago

We figured out a technical solution: reorder the chromosomes in the source BAM files to karyotypic order (so 1,2,3..9,10,11..19,20..29,30..XYM). It is computationally intensive (hence I wanted to avoid this solution if possible) but works so far. The chromosome name conversion can be easily done in the reference pipeline via sed, so it is not really an issue.

Thank you for the assistance and your time, I am confident I will be able to continue using the pipeline with this knowledge.

Best regards Kornel

stschiff commented 1 year ago

Great. OK, I'll close this for now. Otherwise feel free to reopen to continue to the discussion.