mourisl / T1K

T1K is a versatile methods to genotype highly polymorphic genes (e.g. KIR, HLA) with bulk or single-cell RNA-seq, WGS or WES data.
MIT License
42 stars 7 forks source link

How to run post analysis steps such as copynumber and novel SNP detection? #11

Closed yingchen69 closed 9 months ago

yingchen69 commented 1 year ago

Hi,

The readme mentions that t1k post analysis steps can do copynumber and novel SNP detection. Is there any detail regarding how to do the tasks? There is a t1k-copynumber.py, but I am not sure if python t1k-copynumber.py -g T1K_genotyping_result_file can work.

Thanks a lot for the help!

Ying

mourisl commented 1 year ago

The novel SNP detection is automatically included in the workflow/wrapper, and the vcf file is the SNP detection results where the coordinate is the concatenated exons for each allele.

For the t1k-copynumber.py script, the "-g" option takes the the XXX_genotype.tsv file generated by T1K. You may also add the gene names to option "--nomissing" as a comma-separated list, where the genes are expected to be present on every chromosome. For example, we know there are four KIR framework genes, so the copy number inference command should be:

python3 t1k-copynumber.py -g XXX_genotype.tsv --nomissing KIR3DL3,KIR2DL4,KIR3DP1,KIR3DL2

Hope this helps and please let me know if you have further questions.

yingchen69 commented 1 year ago

Hi,

Thanks a lot for the quick reply!

I am still a bit confused by the SNP detection. Here is my command to get the genotype.tsv:

run-t1k -1 my_R1.fq.gz -2 my_R2.fq.gz --preset hla -f /hlaidx/hlaidx_dna_seq.fa --od outputdir -o mysampleid

I do not see any vcf in the output folder. Should I set more parameter to get SNP result in the output?

Best,

Ying

mourisl commented 1 year ago

Can you show me the output on the screen?

Another way to call the variants is through the "./analyzer" command, which can be run as: "./analyzer -f /hlaidx/hlaidx_dna_seq.fa -a outputdir/mysampleid_allele.tsv -1 outputdir/mysampleid_aligned_1.fa -2 outputdir/mysampleid_aligned_1.fa.gz -s 0.97 -o outputdir/mysampleid" in your case. After that you shall see the outputdir/mysampleid_allele.tsv file.

yingchen69 commented 1 year ago

Hi,

Our server is currently down for monthly maintenance. I will get the list of outputs of my T1K jobs tomorrow once it's back online.

Thanks a lot for the help!

Ying

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Li Song @.> Sent: Wednesday, May 17, 2023 2:47:27 PM To: mourisl/T1K @.> Cc: yingchen69 @.>; Author @.> Subject: Re: [mourisl/T1K] How to run post analysis steps such as copynumber and novel SNP detection? (Issue #11)

Can you show me the output on the screen?

Another way to call the variants is through the "./analyzer" command, which can be run as: "./analyzer -f /hlaidx/hlaidx_dna_seq.fa -a outputdir/mysampleid_allele.tsv -1 outputdir/mysampleid_aligned_1.fa -2 outputdir/mysampleid_aligned_1.fa.gz -s 0.97 -o outputdir/mysampleid" in your case. After that you shall see the outputdir/mysampleid_allele.tsv file.

— Reply to this email directly, view it on GitHubhttps://github.com/mourisl/T1K/issues/11#issuecomment-1551892008, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABQ767T7VEUDMO4FTKWOWO3XGUML7ANCNFSM6AAAAAAYESQUS4. You are receiving this because you authored the thread.Message ID: @.***>

yingchen69 commented 1 year ago

Hi,

I just got our server back and I checked that we do have the allele.vcf files. Sorry for the confusion :(

I got t1k-copynumber.py working, but I am not sure about the column names. From the code it seems the column names should be gene, number of alleles, allele 1, allele 1 cn, allele 1 ratio, allele 2, allele 2 cn, allele 2 ratio. What is the 'ratio'? Is it in log2?

Another question, how does T1K deal with duplicated reads? My data are all from whole exon sequencing and I see ~20% duplicates rate by flagstat. I tried several tools for HLA typing such as polysolver, OptiType, hisat-genotype,HLAHD, HLALA. I always got totally different HLA typing results if I removed duplicated reads from the bam files first.

Thanks a lot!

Ying

mourisl commented 1 year ago

Ratio is the log-ratio of the likelihood between the most likely copy number and the second likely copy number. I'm still trying to optimize t1k-copynumber.py, so please interpret its result with caution.

We don't remove the duplicated reads. The duplicated reads will contribute to the allele abundance estimation (or other type of allele score in other HLA genotypers), therefore it is expected that the deduplication will affect the genotyping results. Hope this helps.

yingchen69 commented 1 year ago

Hi,

Thanks a lot for the details!

Regarding the duplicate reads, is their a reason to keep them for analysis? Normally we think duplicate reads are artifacts due to PCR, optical...

Best,

Ying

mourisl commented 1 year ago

In RNA-seq data, I saw some genes are very highly expressed and some reads can become duplicated by chance. I think the deduplication should be left to the user when preprocessing the fastq file if they feel the duplication has become a severe issue.