knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

issue with as.matrix for subsetting FASTA file to 1 chromosome? #155

Closed lesneski closed 4 years ago

lesneski commented 4 years ago

Hello, I have been having trouble subsetting my fasta file into 1 chromosome while following the instructions on [https://knausb.github.io/vcfR_documentation/subset_data_to_1chrom.html] When I get to the last step, I get a return of "Error in as.matrix.DNAbin(dna2) : DNA sequences in list not of the same length."

All steps prior worked fine - I also tried to run these steps with the pinf_sc50.fasta file available from the package, changing inputs where necessary, but I also couldn't get the subset fasta steps to work.

Thank you in advance.

knausb commented 4 years ago

Hello, not sure I follow you. It is usually best to create a reproducible example. Below is my attempt.

library(ape)
data(woodmouse)
woodmouse <- as.list(woodmouse)
woodmouse[[3]] <- woodmouse[[3]][1:100]
woodmouse
#> 15 DNA sequences in binary format stored in a list.
#> 
#> Mean sequence length: 907.333 
#>    Shortest sequence: 100 
#>     Longest sequence: 965 
#> 
#> Labels:
#> No305
#> No304
#> No306
#> No0906S
#> No0908S
#> No0909S
#> ...
#> 
#> Base composition:
#>     a     c     g     t 
#> 0.307 0.262 0.126 0.306 
#> (Total: 13.61 kb)

my_seq <- woodmouse[grep("No306", names(woodmouse))]
my_seq
#> 1 DNA sequence in binary format stored in a list.
#> 
#> Sequence length: 100 
#> 
#> Label:
#> No306
#> 
#> Base composition:
#>    a    c    g    t 
#> 0.35 0.30 0.07 0.28 
#> (Total: 100 bases)

Created on 2020-03-30 by the reprex package (v0.3.0)

This appears to work as expected. Can you modify this so that I can reproduce the error?

The error you're reporting appears to be from ape::as.matrix.DNAbin(). My understanding is that a DNAbin object may be a vector, matrix, or list. If it's a matrix all sequences need to be the same length. We would expect this in a multiple sequence alignment, but not in a genome where different contigs are of different sizes. Somewhere in the code this function appears to be called on a DNAbin that contains sequences of different length. Please let me know what you find!

lesneski commented 4 years ago

Hi, thanks for replying so quickly - I do indeed have sequences of different length, since I am working with a denovo transcriptome to create the VCF. I am trying to get the the point where I can make a heatmap to visualize quality scores across samples such as the heatmap.bp function. I have one sample that did not bin to population as expected and wanted to examine the quality scores somehow. Looking again through the tutorial and the format of files such as the pinf_sc50.fasta file, it looks like my data is not set up to easily be able to subset into 1 chromosome at a time (which looks like is the case for the pinf_sc50.fasta example?)

here is what the inputs look like:

dna<-read.dna("Holobiont_transcripts.fa", format="fasta")] dna 1002289 DNA sequences in binary format stored in a list.

Mean sequence length: 1531.294 Shortest sequence: 100 Longest sequence: 37688

Labels: Acer_BB1_Locus_1_Transcript_1of285068_Confidence_0.000_Lengt... Acer_BB1_Locus_1_Transcript_2of285068_Confidence_0.000_Lengt... Acer_BB1_Locus_1_Transcript_3of285068_Confidence_0.000_Lengt... Acer_BB1_Locus_1_Transcript_4of285068_Confidence_0.000_Lengt... Acer_BB1_Locus_1_Transcript_5of285068_Confidence_0.000_Lengt... Acer_BB1_Locus_1_Transcript_6of285068_Confidence_0.000_Lengt... ...

More than 10 million bases: not printing base composition (Total: 1.53 Gb) vcf<-read.vcfR("All_Snpsvcf_biallelic_minMAF05_GQ20_DP7_noNA_recode.vcf") vcf Object of Class vcfR 39 samples 130772 CHROMs 287,804 variants Object size: 374.6 Mb 0 percent missing data


lesneski commented 4 years ago

Working on this more today, I think the first pass of what I want to look at would be using the extract.info function on the vcf to pull out DP and plot that on a heatmap - however I have not figured out how to associate that numeric dataset with the contig names and sample names :

dp <- extract.info(vcf, element="DP", as.numeric = TRUE) head(dp, 100) [1] 16 15 13 13 250 2 10 10 169 27 339 157 16 94 15 124 2 2 65 17 36 54 10 [24] 10 12 77 22 94 73 21 13 10 13 79 44 39 3060 136 83 45 299 15 18 24 74 111 [47] 201 219 222 219 325 230 221 216 282 299 14 5 77 66 45 123 142 99 109 22 33 18 18 [70] 18 21 44 146 134 34 34 29 30 27 27 112 120 117 128 128 125 126 123 122 120 132 137 [93] 165 145 154 149 160 161 173 19

Using extract.gt with DP ends up with all NAs in the matrix:

gt<-extract.gt(vcf, element="DP", as.numeric=TRUE) head(gt) B11.sorted.bam B12.sorted.bam B2.sorted.bam Acer_BB1_Locus_1_Transcript_2of285068_Confidence_0.000_Length_1696_266 NA NA NA Acer_BB1_Locus_1_Transcript_5of285068_Confidence_0.000_Length_897_598 NA NA NA Acer_BB1_Locus_1_Transcript_5of285068_Confidence_0.000_Length_897_600 NA NA NA Acer_BB1_Locus_1_Transcript_5of285068_Confidence_0.000_Length_897_601 NA NA NA Acer_BB1_Locus_1_Transcript_19of285068_Confidence_0.000_Length_595_429 NA NA NA Acer_BB1_Locus_1_Transcript_21of285068_Confidence_0.000_Length_293_79 NA NA NA B22.sorted.bam B_B1.sorted.bam B_B16.sorted.bam Acer_BB1_Locus_1_Transcript_2of285068_Confidence_0.000_Length_1696_266 NA NA NA Acer_BB1_Locus_1_Transcript_5of285068_Confidence_0.000_Length_897_598 NA NA NA Acer_BB1_Locus_1_Transcript_5of285068_Confidence_0.000_Length_897_600 NA NA NA Acer_BB1_Locus_1_Transcript_5of285068_Confidence_0.000_Length_897_601 NA NA NA Acer_BB1_Locus_1_Transcript_19of285068_Confidence_0.000_Length_595_429 NA NA NA Acer_BB1_Locus_1_Transcript_21of285068_Confidence_0.000_Length_293_79 NA NA NA

knausb commented 4 years ago

Hi, if your goal is to explore the quality of your variants, then I don't see why you need FASTA data. The below links demonstrate how I typically approach this.

https://knausb.github.io/vcfR_documentation/sequence_coverage.html

https://knausb.github.io/vcfR_documentation/depth_plot.html

Note that different variant callers produce VCF files that contain different information. The function queryMETA() is intended to help you explore what sort of information is included in your VCF file. If your data does not include "DP" then that could be why you get all NAs. Also note that a VCF file is just a text file, so you can open it with any text browser, at least ones that can handle large files. Thank you for reporting your progress, please continue to!

lesneski commented 4 years ago

Hi again, Thanks for the links. I am wondering if its possible that there DP is missing from my vcf in some form. A screenshot shows that DP is under the "INFO" column:

Screen Shot 2020-03-31 at 7 01 59 PM

And doing query(META) gives

queryMETA(vcf, element = 'DP') [[1]] [1] "INFO=ID=DP" "Number=1" "Type=Integer"
[4] "Description=Raw read depth"

[[2]] [1] "INFO=ID=DP4"
[2] "Number=4"
[3] "Type=Integer"
[4] "Description=Number of high-quality ref-forward " [5] " ref-reverse"
[6] " alt-forward and alt-reverse bases"

however, head(vcf) shows that DP is not listed under unique GT formats!

head(vcf) [1] " Object of class 'vcfR' " [1] " Meta section " [1] "##fileformat=VCFv4.2" [1] "##FILTER=<ID=PASS,Description=\"All filters passed\">" [1] "##samtoolsVersion=1.10+htslib-1.10" [1] "##samtoolsCommand=samtools mpileup -gf Holobiont_transcripts.fa -o A [Truncated]" [1] "##reference=file://Holobiont_transcripts.fa" [1] "##contig=<ID=Acer_BB1_Locus_1_Transcript_1of285068_Confidence0.000 [Truncated]" [1] "First 6 rows." [1] [1] " Fixed section " CHROM POS ID REF ALT [1,] "Acer_BB1_Locus_1_Transcript_2of285068_Confidence_0.000_Length_1696" "266" NA "A" "G" [2,] "Acer_BB1_Locus_1_Transcript_5of285068_Confidence_0.000_Length_897" "598" NA "T" "G" [3,] "Acer_BB1_Locus_1_Transcript_5of285068_Confidence_0.000_Length_897" "600" NA "A" "T" [4,] "Acer_BB1_Locus_1_Transcript_5of285068_Confidence_0.000_Length_897" "601" NA "T" "A" [5,] "Acer_BB1_Locus_1_Transcript_19of285068_Confidence_0.000_Length_595" "429" NA "G" "C" [6,] "Acer_BB1_Locus_1_Transcript_21of285068_Confidence_0.000_Length_293" "79" NA "G" "C" QUAL FILTER [1,] "4.49432" NA
[2,] "32.6846" NA
[3,] "31.6852" NA
[4,] "31.1472" NA
[5,] "27.9192" NA
[6,] "11.3579" NA
[1] [1] " Genotype section " FORMAT B11.sorted.bam B12.sorted.bam B2.sorted.bam B22.sorted.bam B_B1.sorted.bam [1,] "GT:PL" "0/1:0,0,0" "1/1:4,3,0" "0/1:0,0,0" "0/1:0,3,4" "0/1:0,0,0"
[2,] "GT:PL" "0/1:0,0,0" "0/1:0,0,0" "0/0:0,3,7" "0/1:0,0,0" "0/1:0,0,0"
[3,] "GT:PL" "0/1:0,0,0" "0/1:0,0,0" "0/0:0,6,9" "0/1:0,0,0" "0/1:0,0,0"
[4,] "GT:PL" "0/1:0,0,0" "0/1:0,0,0" "0/0:0,6,9" "0/1:0,0,0" "0/1:0,0,0"
[5,] "GT:PL" "1/1:4,3,0" "1/1:12,12,0" "0/1:0,5,7" "1/1:12,15,0" "1/1:4,3,0"
[6,] "GT:PL" "0/1:0,0,0" "0/1:0,0,0" "0/1:0,0,0" "0/1:0,0,0" "0/1:0,0,0"
[1] "First 6 columns only." [1] [1] "Unique GT formats:" [1] "GT:PL" [1]

So maybe I cannot do what I would like to with this VCF and the current formats available!

knausb commented 4 years ago

My understanding is that the INFO column is a summary over all samples. I don't look at it these days. And if you're trying to identify particular samples with quality issues I do not think this information will help you. I would not say that "DP" is "missing" from your data, it's just not present. You can have a valid VCF file that does not contain this information and yours does not appear to include "DP". You have just the genotype (GT) and the phred-scaled likelihood of each genotype (PL). I'm afraid I think you need to go back and call variants again. I know that's a big step but if you want more data-rich VCF files I think that's your only option. I do not use SAMtools for variant calling these days. But I remember it providing more information. You might be able to parameterize it to produce more information? I currently use GATK, and it produces information rich data. Recalling variants with GATK would be my recommendation.

lesneski commented 4 years ago

Ah ok, I will call again with GATK, thank you for your patience and help!