hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
181 stars 56 forks source link

Lilac on WES data #415

Closed AAAapollo closed 1 year ago

AAAapollo commented 1 year ago

Dear developer,

I am running Lilac(v1.4.2) on >200x depth WES data, and some samples error are shown below.

And the genome version that we used for alignment is hg19, and there are some ALTs of chr6 (like chr6_apd_hap1, chr6_cox_hap2, ...). The parameter is mostly the same with the parameter in HMFtools pipeline, but is added with "-fatal_low_coverage 10000". And many samples get different HLA typing with previous polysolver typing result.

So my question is if Lilac can work on WES data, and how can I fixed the bugs as below and get consistent result with polysolver.

Looking forward to your reply, thanks!

[INFO ] Lilac version(1.4.2), key parameters: [INFO ] sample(P68) inputs: tumorBam(false) somaticVCF(true) geneCopyNumber(true) rnaBam(false) [INFO ] minBaseQual(30), minEvidence(2) minFragmentsPerAllele(7) minFragmentsToRemoveSingle(40) maxDistanceFromTopScore(0.005) [INFO ] reading nucleotide file: hla_ref_nucleotide_sequences.csv [INFO ] loaded 17542 sequences from file hla_ref_nucleotide_sequences.csv [INFO ] reading protein file: hla_ref_aminoacid_sequences.csv [INFO ] loaded 13198 sequences from file hla_ref_aminoacid_sequences.csv [INFO ] loaded 128 common alleles [INFO ] querying records from reference bam P68.bam [INFO ] lowering min base quality(30) to median(28) [INFO ] totalFrags(911) minEvidence(2.0) minHighQualEvidence(0.3) [INFO ] gene(HLA-A) determining un-phased candidates from frags(896) [WARN ] no candidates after amino acid filtering - reverting to common allele gene candidates [INFO ] gene(HLA-B) determining un-phased candidates from frags(853) [WARN ] no candidates after amino acid filtering - reverting to common allele gene candidates [INFO ] gene(HLA-C) determining un-phased candidates from frags(787) [WARN ] no candidates after amino acid filtering - reverting to common allele gene candidates [INFO ] gene(HLA-A) has 0 candidates after phasing: [INFO ] gene(HLA-B) has 0 candidates after phasing: [INFO ] gene(HLA-C) has 5 candidates after phasing: C01:02, C04:01, C04:09N, C04:82, C14:02 [INFO ] recovering common alleles: A01:01, A01:02, A02:01, A02:02, A02:03, A02:05, A02:06, A02:07, A02:20, A03:01, A03:02, A11:01, A23:01, A24:02, A24:03, A24:07, A25:01, A26:01, A26:08, A29:01, A29:02, A30:01, A30:02, A30:04, A31:01, A32:01, A33:01, A33:03, A34:01, A34:02, A66:01,A68:01, A68:02, A69:01, A74:01, B07:02, B07:05, B07:06, B08:01, B13:01, B13:02, B14:01, B14:02, B15:01, B15:02, B15:03, B15:07, B15:13, B15:17, B15:18, B15:21, B18:01, B27:02, B27:05, B27:06, B27:07, B35:01, B35:02, B35:03, B35:05, B35:08, B37:01, B38:01, B38:02, B39:01, B39:06, B40:01, B40:02, B40:06, B41:01, B41:02, B42:01, B44:02, B44:03, B44:04, B44:05, B44:27, B45:01, B46:01, B47:01, B48:01, B49:01, B50:01, B50:02, B51:01, B51:07, B51:08, B52:01, B53:01, B55:01, B56:01, B57:01, B57:03, B58:01, B73:01, B81:01, C02:02, C02:10, C03:02, C03:03, C03:04, C04:03, C05:01, C06:02, C07:01, C07:02, C07:04, C07:06, C07:18, C08:01, C08:02, C12:02, C12:03, C12:143, C15:02, C15:05, C15:06, C16:01, C16:02, C16:04, C17:01, C17:03, C18:01 [INFO ] building frag-alleles from aminoAcids(frags=911 candSeq=128) nucFrags(hetLoci=3 candSeq=3717 nucs=1101) knownIndels(0) [INFO ] HLA-Y fragments(182 unique=76) shared=106) above threshold [INFO ] filtering candidates from fragAlleles(527) candidates(128) recovered(123) [INFO ] confirmed 1 unique groups: C01[103,72,31,0] [INFO ] found 1 insufficiently unique groups: A33[18,3,15,0] [INFO ] keeping 2 recovered alleles from unique groups: A33:01, A33:03 [INFO ] confirmed 1 unique proteins: C01:02[152,126,26,0] [INFO ] found 2 insufficiently unique proteins: A33:03[54,4,50,0], C*14:02[31,4,27,0] [INFO ] building frag-alleles from aminoAcids(frags=911 candSeq=7) nucFrags(hetLoci=3 candSeq=304 nucs=1101) knownIndels(0) [INFO ] calculating coverage for complexes(0) and ref alleles(392) [FATAL] failed to calculate complex coverage

p-priestley commented 1 year ago

There is no fundamental problem with WES data. The log above though suggests you have poor or very uneven coverage of the HLA-A/HLA-B/HLA-C genes. Are you perhaps using a hg38_alt genome? LILAC does not currently support gathering of ALT alignments from the HLA ALT contigs. Please note this comment on our readme:

LILAC supports GRCH37, hg19 and hg38 (no alt) ref genomes. Note that if the genome has been aligned to an hg38 genome with HLA alt contigs, that fragments from these contigs will need to be extracted and realigned to the no alt genome prior to running LILAC.

If you are not using a reference with ALT contigs, then it would be good to visually check the coverage of each of the genes.

AAAapollo commented 1 year ago

There is no fundamental problem with WES data. The log above though suggests you have poor or very uneven coverage of the HLA-A/HLA-B/HLA-C genes. Are you perhaps using a hg38_alt genome? LILAC does not currently support gathering of ALT alignments from the HLA ALT contigs. Please note this comment on our readme:

LILAC supports GRCH37, hg19 and hg38 (no alt) ref genomes. Note that if the genome has been aligned to an hg38 genome with HLA alt contigs, that fragments from these contigs will need to be extracted and realigned to the no alt genome prior to running LILAC.

If you are not using a reference with ALT contigs, then it would be good to visually check the coverage of each of the genes.

Thanks. We used hg19 genome version, but with ALT contigs like chr6_apd_hap1, chr6_cox_hap2 and so on. Will this effect the coverage of HLA genes?

p-priestley commented 1 year ago

I don't know about that ref genome specifically, but only HLA related ALT contigs should cause problems, so it is unlikely

This line suggests that in total only 911 fragments were found overlapping the HLA gene exons which seems very low for exome data:

[INFO ] totalFrags(911) minEvidence(2.0) minHighQualEvidence(0.3)

Did you get a qc file produced. If so, perhaps you could share that?

I think I would need to look at the data directly to comment more on the specific issue. You can email me at p.priestley@hartwigmedicalfoundation.nl if you want to discuss sharing that for one patient to diagnose the problem

AAAapollo commented 1 year ago

I don't know about that ref genome specifically, but only HLA related ALT contigs should cause problems, so it is unlikely

This line suggests that in total only 911 fragments were found overlapping the HLA gene exons which seems very low for exome data:

[INFO ] totalFrags(911) minEvidence(2.0) minHighQualEvidence(0.3)

Did you get a qc file produced. If so, perhaps you could share that?

I think I would need to look at the data directly to comment more on the specific issue. You can email me at p.priestley@hartwigmedicalfoundation.nl if you want to discuss sharing that for one patient to diagnose the problem

Actually, lilac failed on this patient, so there was no any file output. And is there any solution for lilac to deal with such genome within ALT contigs?

p-priestley commented 1 year ago

I am not sure the problem is the reference genome. It might just be that certain exons have very poor coverage in your exome panel. I cannot tell without looking at some example data