Closed Irallia closed 2 years ago
Merging #200 (9811e40) into master (e44bde5) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #200 +/- ##
=======================================
Coverage 98.34% 98.35%
=======================================
Files 18 18
Lines 846 850 +4
=======================================
+ Hits 832 836 +4
Misses 14 14
Impacted Files | Coverage Δ | |
---|---|---|
src/iGenVar.cpp | 100.00% <100.00%> (ø) |
|
...sv_detection_methods/analyze_split_read_method.cpp | 96.80% <100.00%> (ø) |
|
src/variant_detection/variant_detection.cpp | 96.63% <100.00%> (+0.05%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update e44bde5...9811e40. Read the comment docs.
@joergi-w I think you need to do the same change in the SNP calling part.
Nice work! I have one major concern to think about: With SA tags like
1,101,+,6M94S,60,0;2,101,+,6S10M84S,60,0;1,107,+,16S10M74S,60,0;
and also the new vcf file, I find it much harder to read than before, because thechr
helped at least to spot the sequence name. This leads me to the question: how about adding thechr
prefix where it is missing rather than removing it? A regex_match could spot a single number in the reference name. What are possible disadvantages? Maybe we can discuss...
Yes, that is really a difficult question. Adding the "chr" shouldn't be any more difficult than removing it, and yes it would definitely be more readable. The only reason I decided to remove it was because the original truth set I use is without the "chr". But I have already created an edited truth set with "chr", so it wouldn't be such a big problem...
Will review in a bit! But reading through made me think of something, what's the check we have to make sure the BAM files are all from the same reference genome? So that we don't have a long read file aligned to hg19, and a short read file aligned to GRCh38 or something, which could cause some unexpected behavior and incorrect results?
Will review in a bit! But reading through made me think of something, what's the check we have to make sure the BAM files are all from the same reference genome? So that we don't have a long read file aligned to hg19, and a short read file aligned to GRCh38 or something, which could cause some unexpected behavior and incorrect results?
Oh that's a good point, I've already encountered that exact problem. As far as I can see, some of my example BAM files (all HG002) are aligned to different references:
MtSinai_PacBio: GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
PacBio_CCS_10kb: hs37d5.fa
10X_Genomics: hg19.reordered.fa
Illumina: GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa
Illumina_Mate_Pair: GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fa
Since we do not use FASTA files so far, we do not check anything. But with vaquita LR I encountered the problem because I can specify only one reference and sometimes just 2 exist, when I have 2 input files.
But you're right, maybe we should exclude this case completely anyway. Do you have an Idea what can go wrong? What exactly happens if I have an "old" aligned BAM file to hg18 and a newer one to hg19? And should we warn the user or forbid these things?
If we take 2 different references, can it happen that the position of a certain SV is different? If this is the case, we should really prohibit different ones.
If we take 2 different references, can it happen that the position of a certain SV is different? If this is the case, we should really prohibit different ones.
Exactly! For example, the sequence "ACAGTCATCGT" might map to chr1:100 in one reference, but chr1:150 in another reference.
Overall, the pull request looks fine. However, I'm not sure if we want to merge this, because of the discussion above about checking to make sure the reference genomes are the same anyways, and also because this would not work for other species, i.e. mouse genomes.
So I think just enforcing that all input BAMs are aligned to the same reference is the most robust solution.
That is a very good and important objection. I have now studied references in great detail and what problems can occur. In general, you probably want to have the same reference for everything, but that's hard for us to check. Also, I think there are situations where that doesn't matter. For example, there are various fasta files for GRCh38 / hg38 (e.g. some with and some without mitochondria or other). These should be compatible with each other, because they all describe the same version. Therefore I would leave the PR as it is, but add a notification that reminds the user to consider this (if he enters two input files).
Chromosome names can differ between "1" and "chr1" in different input files. IGenVar can read more than one input file, thus we would distinguish same SVs from different input files if the chromosome naming is different. To avoid having to change the input file itself, we decided to make the change within iGenVar. Thus we remove the "chr" prefix.