mehrdadbakhtiari / adVNTR

A tool for genotyping Variable Number Tandem Repeats (VNTR) from sequence data
http://advntr.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
39 stars 15 forks source link

Help understanding outputs #47

Closed hchetia closed 3 years ago

hchetia commented 3 years ago

Hi, I have used the foll. command to run advntr on my PCR-free WGS data from Illumina-

advntr genotype --threads 4 --expansion --coverage 5 --alignment_file mybam.sorted.bam --models /advntr_hg38_selected_VNTRs_Illumina/hg38_selected_VNTRs_Illumina.db --working_directory .

I get the foll. results-

unmapped.fasta (476257 reads) keywords.txt (0 b) filtering_out.txt (0 b) log (shown below) image

My questions- Why only the unmapped reads were used by the tool? How do I generate vcf and bed format output? Can a tab delimited file with predicted GT per VNTR be generated? Is there a way to visualize the reads leading to genotypes- static like REVEIWER or interactive like IGV?

Best, Hasna

mehrdadbakhtiari commented 3 years ago

Hi,

  1. All the reads are generally used to make calls. Let me know if you suspect any issue and I can look further.
  2. The easiest way to see the output is to add --outfile filename.out to the end of the command. To generate different formats for output, you can use one of the --outfmt vcf --outfile output.vcf or --outfmt bed --outfile output.bed at the end of the command to get desired output format.
  3. I believe the bed output format (--outfmt bed --outfile output.bed) generates this.
  4. I believe @Jong-hun-Park has more information about visualization of the reads.

Best, Mehrdad

hchetia commented 3 years ago

Hi @mehrdadbakhtiari Thanks for the reply. Is there a way to generate both vcf and bed at one go? There is no option called --outfile.

mehrdadbakhtiari commented 3 years ago

there isn't a way to generate both vcf and bed in one go. But I believe one can be converted to another. Which version are you using? is --outfile present in the output of advntr genotype --help?

hchetia commented 3 years ago

Hi, Okay, that would solve my problem. However, I am not seeing --outfile in v 1.2.0 image

hchetia commented 3 years ago

Hi @mehrdadbakhtiari

  1. Is there a way to install advntr 1.4.1 via conda?
  2. I reran the program with the new params but I am not getting any outputs other than unmapped.fasta. What could be the problem? The command prompt log is shown below image Advntr run log- image Head of my bam file is shown below (will presence of RG: tag impact the tool somehow?) A00815:188:H55YWDRXY:2:2104:4698:4335 147 chr1 9998 13 167S81M = 10046 -33 CCTAGCCCTACACCTCACCCGCCCCCTAACCCTCACGATATCCATCACCCTATACCTTTACCTATCCCTTTCACTATACTTTAATTTATCGCTTTCCTTCTAGCTATACAGATATGTATCTAAATAGCTATCGTTAACGGTATATCTATACCTATAGCTTTCTCTATCGATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC F,,,,,,,,:,,,,,,,:,:,,,,,,,,,,,,,,,,,,,,,,,,,::,,:,,,,,,,,,,,,,,,,,,:,,:,,,,,,,,,:,,,,,:F,,,,,:,,,,,,,,,,:,,,,,,F,,,,,,:,,,,,,,,F,::,,:,,F,,:,,,,,FF::,FFF:,,,F,:,,,F,,,,,:,,,,FFFF:,:FF,F,FFF,:,FFF::F:FF,:,:FF:,FFFF,:FFF:F:FFF:FFFFFF,FFFFF:F:,F::,F, NM:i:0 MD:Z:81 MC:Z:65M184S AS:i:81 XS:i:79 RG:Z:fixed_63D2_S4_L00C_val XA:Z:chr22,+50808391,79M169S,0;chr7,-10002,170S78M,0;chr22,+50807943,78M170S,0;chr5,-11309,170S78M,0;chr19,+58607540,77M171S,0; A00815:188:H55YWDRXY:1:2113:4544:13808 2193 chr1 9999 0 176H74M chr2 32916242 0 GATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ,,,,F,,,:F,,F,F:F::FFFF:FFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:74 MC:Z:160H90M AS:i:74 XS:i:73 RG:Z:fixed_63D2_S4_L00C_val SA:Z:chr2,32916253,+,167S83M,13,1; XA:Z:chr22,+50808397,73M177S,0;
mehrdadbakhtiari commented 3 years ago

Yes 1.4.1 is available in coda. Have you tried conda install -c bioconda advntr or conda update advntr or conda install advntr=1.4.1?

hchetia commented 3 years ago

Yes 1.4.1 is available in coda. Have you tried conda install -c bioconda advntr or conda update advntr or conda install advntr=1.4.1?

I haven't. Let me try that.

hchetia commented 3 years ago

Which python V does 1.4.1 need?

hchetia commented 3 years ago

I installed 1.4.1 and getting the foll warnings image

hchetia commented 3 years ago

HI @mehrdadbakhtiari I installed 1.4.1 and ran it on my bam file despite the warnings shown above. The generated bed file was empty. Other outputs with data were sorted.unmapped.fasta, keywords.sorted.unmapped.fasta.txt and filtering_out_sorted.unmapped.fasta.txt.

Log of the run-

[M::bam2fq_mainloop] discarded 0 singletons [M::bam2fq_mainloop] processed 476257 reads Using TensorFlow backend. /scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) Traceback (most recent call last): File "/scratch/hchetia/miniconda3/envs/hchetia_py36/bin/advntr", line 11, in <module> sys.exit(main()) File "/scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/advntr/__main__.py", line 134, in main genotype(args, genotype_parser) File "/scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/advntr/advntr_commands.py", line 124, in genotype genome_analyzier.find_repeat_counts_from_alignment_file(input_file, average_coverage, args.update) File "/scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/advntr/genome_analyzer.py", line 241, in find_repeat_counts_from_alignment_file average_coverage, update) File "/scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/advntr/profiler.py", line 8, in wrapper retval = func(*args, **kwargs) File "/scratch/hchetia/miniconda3/envs/hchetia_py36/lib/python3.6/site-packages/advntr/vntr_finder.py", line 750, in find_repeat_count_from_alignment_file max_prob=0) # No probability for the estimated genotype TypeError: __init__() got an unexpected keyword argument 'max_prob' (END)

ADVNTR LOG is attached herein. Looking it the log, I can see some VNTRs mapped with reads.\, but none of them showed up in my bed file.

I have successfully predicted genotypes of STRs from my bam file using Expansion hunter, Gangstr etc. and I can't think of a reason why this wouldn't work.

Please help. log.bam.log

Best Hasna

Jong-hun-Park commented 3 years ago

Hi, @hchetia

Can you please let me know your command?

Thanks, Jonghun

hchetia commented 3 years ago

nohup advntr genotype --threads 4 --expansion --coverage 5 --alignment_file sorted.bam --models advntr_hg38_selected_VNTRs_Illumina/hg38_selected_VNTRs_Illumina.db --working_directory . --outfmt bed

Jong-hun-Park commented 3 years ago

Thanks! It seems that you are trying to identify expansions in VNTRs with Illumina reads. I recommend that you rather use genotype mode without --expansion and --coverage flags unless you are specifically interested in very long expansion because the VNTRs trained by the reference model (hg38_selected_VNTRs_Illumina.db) are designed for Illumina reads.

hchetia commented 3 years ago

Hi @Jong-hun-Park COuld you please specify how long is "very long" expansion? My samples' fragment sizes are around 400-450 and I am hoping to capture at least >300 bases of expansion.

Jong-hun-Park commented 3 years ago

Thanks for the information. If you are to identify such expansions with Illumina reads, I would recommend using other tools specifically designed for expansions such as expansion hunter. Although adVNTR offers a coverage-based method, adVNTR is not fully tested on such expansion cases with the method and is more suitable for genotype VNTRs that can be spanned by reads.

hchetia commented 3 years ago

Hi @Jong-hun-Park Thanks for the explanation. I think adVNTR could be a good second evidence for some VNTRs. Did you observe a threshold of the maximum allele size detected by adVNTR for any one VNTR?

Jong-hun-Park commented 3 years ago

Without the coverage-based method, adVNTR is able to genotype VNTR alleles longer than read length, but the accuracy drops as the allele size is increased as you can see in the figure below. You can find the figure in the supplementary data in this paper (adVNTR-NN paper) image

Please let me know if you have further questions.

hchetia commented 3 years ago

The % of accuracy starts dropping sharply at around >100 bases. Thanks for explaining this.

hchetia commented 3 years ago

Thanks! It seems that you are trying to identify expansions in VNTRs with Illumina reads. I recommend that you rather use genotype mode without --expansion and --coverage flags unless you are specifically interested in very long expansion because the VNTRs trained by the reference model (hg38_selected_VNTRs_Illumina.db) are designed for Illumina reads.

With respect to this comment, I actually finished running the program with modified command. nohup advntr genotype --threads 4 --alignment_file sorted.bam --models /advntr_hg38_selected_VNTRs_Illumina/hg38_selected_VNTRs_Illumina.db --working_directory . --outfmt bed

The outputs included a log file (attached), unmapped.fasta, keywords.unmapped.fasta.text and filtering_out.unmapped.fasta.text. No .bed file was generated.

Jong-hun-Park commented 3 years ago

Oh, please add -o option (e.g. -o outfile.bed) to write the results to a file. Otherwise, it will output to stdout.