Generating Alternative Reference

grpiccoli commented 1 year ago

Hi,

I have a hemizygous genome and I have detected SVs on the main reference genome using dysgu. I wanted to confirm that the filtering parameters I am using would make sense in this case, and also to check if you knew of any good tools that can create an alternative reference genome using the vcf output file from dysgu. I ran dysgu as follows:

dysgu run -p$cpus --mode pacbio --pl pacbio --drop-gaps False --verbosity 2 --metrics --clean --overwrite $ref temp_${name} $file -o ${name}.vcf dysgu filter -p$cpus --min-prob 0.5 -o ${name}.flt.vcf ${name}.vcf

I am merging the results from dysgu with the results from deepvariant to obtain an full alt vcf. I was using gatk to generate the alternate reference but the issue is that it only can process SNPs INS and DELs but not DUP, TRA and INV gatk IndexFeatureFile -I 3.chr.vcf gatk CreateSequenceDictionary -R ${ref}.fasta -O ${ref}.dict gatk FastaAlternateReferenceMaker -R ${ref}.fasta -O reference_alt.fasta -V 3.chr.vcf

is there any good alternative for this that you could recommend?

Thanks

kcleal commented 1 year ago

Hi @grpiccoli, I think the filtering looks OK, you should remove most false-positives. Lowering the --min-prob will increase sensitivity though. Making a new fasta sequence is not something I have tried myself, so can't claim any expertise. I image creating a new reference from SNPs + indels + symbolic SVs (DEL, INV etc, that dont have exact sequences) would be quite a challenge. The SVs can lead to non-trivial genome structure. DNA assembly/scaffolding programs might be worth checking out?

grpiccoli commented 1 year ago

@kcleal thank you for your reply. What min-prob would you recommend? about the alternative assembly I am trying bcftools consensus I think it should work for more complex variant types, I'll let you know how it goes. I have an alternative assembly and I would like to compare it to the alternate assembly from the SV analysis. Thanks

kcleal commented 1 year ago

0.5 is usually good is you are using short reads, but for long-reads it can sometimes be lowered. Filtering is quite quick to perform, so I would check the output from 0.4, 0.45 and 0.5. Also, you may want to use --pass-prob 0, to re-label variants in your final set as PASS, otherwise they will retain the same lowProb / PASS labels as before. To save time, you can try our partner tool GW https://github.com/kcleal/gw to quickly visualize the results of filtering, e.g.:

dysgu filter --min-prob 0.4 dysgu.vcf | gw ref.fa -b your.bam -v -

grpiccoli commented 1 year ago

Thank you for the software solutions! gw looks really good. I will close the ticket for now and update it if I find a way to get the alternative reference, just in case someone else might need it.

wenyuhaokikika commented 1 year ago

Hello, I'm also working on the same task, but I'm using bcftools consensus. My mutation data comes from two sources: 1. GATK germline mutations pipeline (https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels/) mainly for SNPs and short indels. 2. Using dysgu to find structural variants.

I believe that a combination of these two types of mutations represents an individual's mutations.

Methods for generating ALT sequences I used

I downloaded WGS data from GEO, ran BWA to get sorted BAM files, and then used dysgu as follows:

sample = SRR10129631
samtools index  $sample"_sorted.bam"
dysgu run -x -p8 /data/wenyuhao/55/CCLE/raw/resources/genome.fasta tmp $sample"_sorted.bam" > $sample"_dysgu.vcf"
dysgu filter --min-prob 0.75 --support-fraction 0.2 SRR10129632_dysgu.vcf

The VCF files generated by dysgu contain ALT values like <DEL>,<DUP>, <INS>, <INV> ,<TRA>, which are not supported by bcftools consensus. Therefore, I want to extract all possible variants from the ALT field.

I referred to two issues issues 10, issues 63 and came up with the following strategy:

For DEL mutations, I delete the sequence of SVLEN length after the ref, keeping the ref unchanged and changing alt to genome[chrom][pos:pos+SVLEN].
For INS mutations, there are two cases. A. If the ALT field is not <INS>, then no changes are needed. B. If the ALT field is <INS>, I change ALT to 'LEFT_SVINSSEQ' + ref + 'RIGHT_SVINSSEQ'
For <DUP>, <INV>, <TRA> mutations, I'm not sure how to handle them yet.

I've written a script process_dysgu.py for this purpose. Could you please confirm if the above process is correct? If not, kindly point out the corrections needed. If it's correct, could you also provide guidance on how to handle the <DUP>, <INV>, and <TRA> mutations?

eg.

python process_dysgu.py -i SRR10129632_dysgu.vcf -r /data/wenyuhao/55/CCLE/raw/resources/genome.fasta | bgzip -c > test.vcf.gz && tabix -p vcf test.vcf.gz
bcftools consensus -f /data/wenyuhao/55/CCLE/raw/resources/genome.fasta  test.vcf.gz > test.fa

question About INFO['END'] and INFO['SVLEN'] for INS

Generally speaking，‘END‘ must be in INFO in SV vcf file, if some rows are missing END field, need to add. I counted the missing END field, usually INS mutation. bcftools consensus will check whether the insertion sequence and SVLEN match, and throw an exception if they do not match

The fasta sequence does not match the REF allele at 1:886224:
   REF .vcf: [T]
   ALT .vcf: [TGCCCTTTGGCAGAGCAGGTGTGCTGTGCTG]
   REF .fa : [TGCCCTTTGGCAGAGCAGGTGTGCTGTGCTG]TGCTGA

However, during the process, I found that the length of the insertion sequence ('LEFT_SVINSSEQ' + ref + 'RIGHT_SVINSSEQ') is not equal to SVLEN. It is either greater than or rarely equal to SVLEN, which has left me puzzled. I am unsure whether SVLEN is meaningful for INS mutations. As a result, I removed the END field, but I am uncertain if this is the correct approach.

If the length of the insertion sequence is greater than SVLEN, I can trim a portion of lowercase letters and a portion of uppercase letters. However, if the length of the insertion sequence is less than SVLEN, how should I handle it?

Supplementary Note 1，The insert sequence ('LEFT_SVINSSEQ' + ref + 'RIGHT_SVINSSEQ') is longer than SVLEN

{'#CHROM': '1',
 'POS': 1478003,
 'ID': 514531,
 'REF': 'C',
 'ALT': '<INS>',
 'FORMAT': 'GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB',
 'SRR10129632': '1/1:25:41:16:0:0:4:16:10:37.7:4:5:15:90:0:1:0.53:1.402:0.744:0.835',
 'SVTYPE': 'INS',
 'END': '1478003',
 'CT': '3to5',
 'SVLEN': '133',
 'CONTIGA': 'AATGTTGTATTTTAGTAGAGACGGGGTTTCTCCATGTTGGTCAGGCTGGTCTCTAACTCCCGACCTCAGGTGATCCACCCGCCTCGGCCTCTCAAACTGTTGGGATTACAGGCATGTGCCACCACGCCTGGCtaatgttgtattttagtagagacggggtttctccatgttggtcag',
 'CONTIGB': 'GTAGAGACGGGGTTTCTCCATGTTGGTCAGGCTGGTCTCTAACTCCCGACCTCAGGTGATCCACCCGCCTCGGCCTCTCAAACTGTTGGGATTACAGGCATGTGCCACCACGCCTGGCtaatgttgtattttagtagagacggggtttctc',
 'RIGHT_SVINSSEQ': 'taatgttgtattttagtagagacggggtttctccatgttggtcag',
 'LEFT_SVINSSEQ': nan}

2，INS but ALT is not <INS>

1   886225  302472  T   GCCCTTTGGCAGAGCAGGTGTGCTGTGCTG  .   PASS
    SVMETHOD=DYSGUv1.5.0;SVTYPE=INS;END=886226;CHR2=1;GRP=302472;NGRP=1;

CT=3to5;CIPOS95=12;CIEND95=12;SVLEN=30;KIND=extra-regional;GC=58.75;NEXP=0;STRIDE=0;

EXPSEQ;RPOLY=10;OL=0;SU=49;WR=7;PE=12;SR=6;SC=23;BND=0;LPREC=1;RT=pe    

GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB   

0/1:200:48.15:49:7:12:6:23:0:45.12:18:19:17:0:0:7:0.745:1.146:0.854:0.771

In fact, bcftools consensus does not entirely lack support for SV VCF files. It only supports three fields: <DEL>, <*> or <NON_REF>, and it cannot handle the <INS>,<DUP>, <INV>, and <TRA> fields. Therefore, in this case, I only dealt with the <INS> field.

└─(21:17:51)──> bcftools consensus -f /data/wenyuhao/55/CCLE/raw/resources/genome.fasta  wnm.vcf.gz > wnm.fa                             ──(Thu,Aug17)─┘
Symbolic alleles other than <DEL>, <*> or <NON_REF> are currently not supported, e.g. "<DUP>" at 1:231202004.
Please use filtering expressions to exclude such sites, for example by running with: -e 'ALT~"<.*>"'

kcleal commented 1 year ago

Hi @wenyuhaokikika, It looks like you have made some good progress on the problem. Ill try and address your points below:

-

dysgu run -x -p8 /data/wenyuhao/55/CCLE/raw/resources/genome.fasta tmp $sample"_sorted.bam" > $sample"_dysgu.vcf"
dysgu filter --min-prob 0.75 --support-fraction 0.2 SRR10129632_dysgu.vcf

The command looks good, you should attain high precision calls with these commands.

-

"For DEL mutations, I delete the sequence of SVLEN length after the ref, keeping the ref unchanged and changing alt to genome[chrom][pos:pos+SVLEN]."

I see no problem with this.

"_For INS mutations, there are two cases. A. If the ALT field is not INS, then no changes are needed. B. If the ALT field is INS, I change ALT to 'LEFT_SVINSSEQ' + ref + 'RIGHTSVINSSEQ'"

LEFT_SVINSSEQ and RIGHT_SVINSSEQ mean that the insertion sequence was not fully mapped during SV calling, and only a fragment of the insertion was mapped. Usually only one of these will be reported, but both might be reported if two SVs were merged during a later step in the pipeline. The SVLEN is also only estimated from the paired-end reads and will not generally be very accurate for these events. The safest bet for these events is to just use either LEFT_SVINSSEQ or RIGHT_SVINSSEQ, not both. It will at least capture some of the insertion sequence accurately. To obtain the full insertion sequence, you would need another assembly based approach.

-

"could you also provide guidance on how to handle the DUP, INV, and TRA mutations?"

SVs in these categories can be quite complex. For DUP and INV, if these events occur in isolation ( without any other SVs nearby) and they are quite small in scale, then you can assume they are simple events. You would need to tandem duplicate or invert the reference over these coordinates. For DUP/INV that are clustered with other events, then the resulting pattern can be more complex, e.g. duplication combined with deletion etc. These would require quite a bit of work to figure out. For TRA, these are often mobile element insertions, unless you are looking at a cancer genome where anything is possible.

Hope this answers some of your questions, but feel free to ask more!

grpiccoli commented 1 year ago

Hi @wenyuhaokikika it looks like we are "phasing" the same problem (pun intended). I am working on a very similar script in python that deals with DEL DUP INS INV BND|TRA. An alternative that did not work for me was to split the reads by haplotype using longphase and then assembling them separately. I mention it just in case that could work for you. Please let me know if we can cooperate in developing this python script Thanks

wenyuhaokikika commented 1 year ago

Hello, the author provides a script convert2bnd.py to split the reads by haplotype using longphase, but even if separated , I still don't know how to assemble them separately.

grpiccoli commented 1 year ago

Hi, that's my issue as well. I have the two phased bam files and I was able to assemble them into a very fragmented assembly using HiFiasm but then when I try to scaffold them it fails, I am still trying to figure out why, but it looks that because the coverage was halved somehow the RAM that it takes to scaffold the assembly grew exponentially and even at 1T of RAM it still runs out of memory. What type of reads are you using? maybe you won't face the same issue if you have already obtained the phased bam files.

grpiccoli commented 1 year ago

Another alternative is to use RagTag to scaffold your reads with the assembly you already have as a template, that might be what you are looking for

grpiccoli commented 1 year ago

I was able to solve my issue by reassembling in the end, with a custom algorithm based on the two approaches mentioned above, I don't think I'll pursue trying to phase the assembly based on the vcf any longer but you can find a very rough draft of what I was working on here https://github.com/grpiccoli/vcf2haplotypes/blob/master/phasedvcf2haplotypes.vcf in case it might be useful to anyone else

wenyuhaokikika commented 1 year ago

Hello, I have a few questions about your script:

Regarding handling DEL. During the first FOR loop, after processing the first DEL, all the positions after the mutation have changed. Won't this cause a shift in the sequence for the subsequent positions? How do you ensure that positions beyond the first DEL remain aligned?

I understand that allelic mutations indicate whether mutations occur on the positive or negative strand. If it's 0/0, it means there are no mutations on either strand. If it's 0/1, it means there's no mutation on the positive strand, but a mutation on the negative strand. And if it's 1/1, it indicates mutations on both the positive and negative strands.The following is from chatGPT


Q:The allelic mutation in vcf is 0/0, 1/1, 0/1. If it is 0/0, it means that there is no mutation on the positive and negative chains, and 0/1 means that the positive chain is not mutated, and the negative chain is mutated. , 1/1 means that both positive and negative are mutated, is that what it means?
A:Yes, your understanding is correct. In the VCF file, allelic mutations are represented using numeric genotypes. The numbers 0, 1, and 2 are commonly used to represent different genotypes. Specifically:

0/0: Both alleles on the positive and negative strands are the same as the reference allele, indicating no mutation.
1/1: Both alleles on the positive and negative strands are different from the reference allele, indicating a homozygous mutation.
0/1: The allele on the positive strand is the reference allele, while the allele on the negative strand is different, indicating a heterozygous mutation.

So, your interpretation of the genotypes is accurate.


If it's a positive and negative strand scenario, why are the same operations performed on both strands? Shouldn't the operations be performed in reverse complement on the negative strand?

From my perspective, multiple mutations occurring on a chromosome are independent events, but their effects are not independent. This issue is complex, and by relying on bcftools consensus, we can generate the corresponding VCF with complete ALT fields.

grpiccoli commented 1 year ago

Hi @wenyuhaokikika, yes it will you are right about your first point, and even if it did work there is no clear way to deal with unphased variants, so you will most likely end up with highly chimeric haplotypes either way. / <-- this indicates that it does not know which strand it occurs in | <-- this indicates it does 0 <-- this means it is the "reference" haplotype (i.e. if it is / it means that it is whatever you used as a reference, in the case of '|' all '1s' will be in the same haplotig) That's why I was trying to reassemble so hard as it is the recommended option in this case. In my case I need to get the phased haplotypes in separate fasta files so only associated variants are grouped on the same file. If your case is different and you don't mind somewhat chimeric haplotypes, then a python script might be what you are after. Let me know if I can help with anything, cheers

grpiccoli commented 1 year ago

Sorry @kcleal this might not related to the original issue any longer. I suspect @wenyuhaokikika and I might be trying to do a different thing. @wenyuhaokikika, Is your ultimate goal to obtain phased fastas, one per haplotype? What type of reads do you have? What type of reference do you have? Is your reference genome already phased? Do you have an LD mapping for your haplotypes? What ploidy does the organism have?

Cheers

wenyuhaokikika commented 1 year ago

Thank you for your patient response. @grpiccoli

I've reviewed the literature and realized my mistake regarding mutations occurring in two Haplotypes. Mutations don't need to consider reverse complementarity, but rather whether they are in the same phase, as you pointed out. I greatly appreciate you pointing out this oversight, and I apologize for the confusion in my previous statement.

While obtaining consensus sequences is a part of my work, my main focus lies in cancer proteomics. I've summarized the issues I'm facing in hopes of receiving your guidance. I also aim to clarify all aspects before I proceed with coding.

My work revolves around WGS data from human cell lines, with read lengths of 126 and a sequencing depth of around 45X.
Currently, my data isn't phased, but I need to consider phased alleles.

I've researched tools for phasing mutations, including HaplotypeTools, Beagle, and SHAPEIT, among others. Ultimately, I've chosen SHAPEIT4 for phasing, based on SHAPEIT and the paper "Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel" (IF=14.92, date=2019).

I've run the following code:

mamba install shapeit4
bam=/data/wenyuhao/55/CCLE/raw/hct116/sorted.bam
vcf=/data/wenyuhao/55/CCLE/raw/hct116/SRR8639145_germline_filter.vcf

bgzip -c $vcf > unphased.vcf.gz
bcftools index unphased.vcf.gz

shapeit4 --input unphased.vcf.gz \
--map shapeit4/maps/chr1.b38.gmap.gz \
--region 1 \
--seed 123 \
--output phased.vcf.gz \
--sequencing \
--reference /data/wenyuhao/55/CCLE/raw/hct116/phase/reference/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
--thread 8 \
--log shapeit_chr1.log

Utilizing GATK-generated mutation data, comprising approximately 100,000 variants, I've only obtained 1383 phased mutations. SHAPEIT4 necessitates a reference (e.g., 1000 Genomes Project) for phasing small sample sizes, rendering it unable to phase mutations absent in the reference. I've noticed a similar issue in the shapeit4-issues68 thread. However, I'm hesitant to discard any mutation data. In this scenario, what approach would you recommend? If no alternative exists, I might need to phase all mutations as 1|1.

I appreciate your assistance and guidance in resolving this matter. Thank you in advance for your support.

grpiccoli commented 1 year ago

Hi @wenyuhaokikika thank you for your detailed answer. I am working with a longread de novo assembly of a hemizygous organism, so the structural variants are between the two copies of "the same" chromosome, this results in only 2 haplotypes in my case. I think in your case you are working with a virtually haploid reference plus a single diploid sample. Could you confirm this? If this is the case then it is possible to have multiallelic variants i.e., 0 would be your reference always, and 1 and 2 would be randomly assigned to a mixture of either one of your diploid sample two haplotypes. Separating variants of 0 from 1,2 should be easy as 0 will always represent the reference in your case, but to separate 1 from 2 you might run into the issue I am describing. Only if you need to separate 1 from 2 then you require phasing (assuming there is only one sample). In the case you do not need phasing of your sample haplotypes, but you still need to generate the two FASTA consensus then you will run into the issue that most tools like "bcftools consensus" will not take into account complex structural variants when generating the consensus. If this is your case and you have a decent genome coverage the best solution would be to do either a reference guided assembly or a de novo assembly, I checked google scholar and in cancer skim WGS they are using supernova assembler for humans, you would need to check the literature for the best assembler though. Now if you need to phase multiple samples from a population then that's a whole different issue (like with HaplotypeTools). Could you confirm if the assumptions are correct and why do you need a FASTA consensus? If all you need is to visualize your variants, then just use the vcf file from dysgu and feed it to something like Ribbon for example and that's all

kcleal / dysgu

Generating Alternative Reference #65

Methods for generating ALT sequences I used

question About INFO['END'] and INFO['SVLEN'] for INS