seqan / iGenVar

The official repository for the iGenVar project.
BSD 3-Clause "New" or "Revised" License
9 stars 8 forks source link

[FIX] Add the "chr" prefix of the ref name. #200

Closed Irallia closed 2 years ago

Irallia commented 2 years ago

Chromosome names can differ between "1" and "chr1" in different input files. IGenVar can read more than one input file, thus we would distinguish same SVs from different input files if the chromosome naming is different. To avoid having to change the input file itself, we decided to make the change within iGenVar. Thus we remove the "chr" prefix.

codecov[bot] commented 2 years ago

Codecov Report

Merging #200 (9811e40) into master (e44bde5) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #200   +/-   ##
=======================================
  Coverage   98.34%   98.35%           
=======================================
  Files          18       18           
  Lines         846      850    +4     
=======================================
+ Hits          832      836    +4     
  Misses         14       14           
Impacted Files Coverage Δ
src/iGenVar.cpp 100.00% <100.00%> (ø)
...sv_detection_methods/analyze_split_read_method.cpp 96.80% <100.00%> (ø)
src/variant_detection/variant_detection.cpp 96.63% <100.00%> (+0.05%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e44bde5...9811e40. Read the comment docs.

Irallia commented 2 years ago

@joergi-w I think you need to do the same change in the SNP calling part.

Irallia commented 2 years ago

Nice work! I have one major concern to think about: With SA tags like 1,101,+,6M94S,60,0;2,101,+,6S10M84S,60,0;1,107,+,16S10M74S,60,0; and also the new vcf file, I find it much harder to read than before, because the chr helped at least to spot the sequence name. This leads me to the question: how about adding the chr prefix where it is missing rather than removing it? A regex_match could spot a single number in the reference name. What are possible disadvantages? Maybe we can discuss...

Yes, that is really a difficult question. Adding the "chr" shouldn't be any more difficult than removing it, and yes it would definitely be more readable. The only reason I decided to remove it was because the original truth set I use is without the "chr". But I have already created an edited truth set with "chr", so it wouldn't be such a big problem...

joshuak94 commented 2 years ago

Will review in a bit! But reading through made me think of something, what's the check we have to make sure the BAM files are all from the same reference genome? So that we don't have a long read file aligned to hg19, and a short read file aligned to GRCh38 or something, which could cause some unexpected behavior and incorrect results?

Irallia commented 2 years ago

Will review in a bit! But reading through made me think of something, what's the check we have to make sure the BAM files are all from the same reference genome? So that we don't have a long read file aligned to hg19, and a short read file aligned to GRCh38 or something, which could cause some unexpected behavior and incorrect results?

Oh that's a good point, I've already encountered that exact problem. As far as I can see, some of my example BAM files (all HG002) are aligned to different references:

    MtSinai_PacBio:     GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    PacBio_CCS_10kb:    hs37d5.fa
    10X_Genomics:       hg19.reordered.fa
    Illumina:           GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa
    Illumina_Mate_Pair: GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa

Since we do not use FASTA files so far, we do not check anything. But with vaquita LR I encountered the problem because I can specify only one reference and sometimes just 2 exist, when I have 2 input files.

But you're right, maybe we should exclude this case completely anyway. Do you have an Idea what can go wrong? What exactly happens if I have an "old" aligned BAM file to hg18 and a newer one to hg19? And should we warn the user or forbid these things?

Irallia commented 2 years ago

If we take 2 different references, can it happen that the position of a certain SV is different? If this is the case, we should really prohibit different ones.

joshuak94 commented 2 years ago

If we take 2 different references, can it happen that the position of a certain SV is different? If this is the case, we should really prohibit different ones.

Exactly! For example, the sequence "ACAGTCATCGT" might map to chr1:100 in one reference, but chr1:150 in another reference.

Irallia commented 2 years ago

Overall, the pull request looks fine. However, I'm not sure if we want to merge this, because of the discussion above about checking to make sure the reference genomes are the same anyways, and also because this would not work for other species, i.e. mouse genomes.

So I think just enforcing that all input BAMs are aligned to the same reference is the most robust solution.

That is a very good and important objection. I have now studied references in great detail and what problems can occur. In general, you probably want to have the same reference for everything, but that's hard for us to check. Also, I think there are situations where that doesn't matter. For example, there are various fasta files for GRCh38 / hg38 (e.g. some with and some without mitochondria or other). These should be compatible with each other, because they all describe the same version. Therefore I would leave the PR as it is, but add a notification that reminds the user to consider this (if he enters two input files).