knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

vcfR Subsetting to the first chromosome #211

Open charbel2998 opened 1 year ago

charbel2998 commented 1 year ago

Hello, I am writing this code:

library(vcfR) library(ape)

input vcf Data S17 RES SNPs

vcf = read.vcfR("~/Desktop/Final_Table/RES_S17_readyAnalysis_SNP.vcf") genome = ape::read.dna("~/Desktop/Resource/hg38.fa", format = "fasta") ref = read.table("~/Desktop/GRCh38_latest_genomic.gff", sep = '\t', quote = "")

creating a ChromeR object

chrom <- create.chromR(name="S17SNP", vcf=vcf, seq=genome, ann=ref, verbose=TRUE) plot(chrom)

I'm getting this error that I can't explain: vcfR object includes more than one chromosome (CHROM) Subsetting to the first chromosome Warning message: In create.chromR(name = "S17SNP", vcf = vcf, seq = genome, ann = ref, : Names in variant data and annotation data do not match perfectly. If you choose to proceed, we'll do our best to match the data. But prepare yourself for unexpected results.

the fasta file is the same one I used to do the alignment and the annotation.

Does anyone have any input?

knausb commented 1 year ago

Hi @charbel2998 ,

For the following

vcfR object includes more than one chromosome (CHROM) Subsetting to the first chromosome

please see the following link.

https://knausb.github.io/vcfR_documentation/subset_data_to_1chrom.html

The chromR object is intended for a single chromosome (sequence).

For the WARNING

In create.chromR(name = "S17SNP", vcf = vcf, seq = genome, ann = ref, : Names in variant data and annotation data do not match perfectly. If you choose to proceed, we'll do our best to match the data. But prepare yourself for unexpected results.

Please see below.

https://knausb.github.io/vcfR_documentation/chromR_object.html

It has been my experience that the names do not always match. Sometimes we download them like this. Sometime a processing step we use creates the issue. For example, I believe bwa (and other aligners) tends to truncate FASTA sequence names on whitespace, resulting in sequence names you may not have expected. You can validate this by comparing the names in your FASTA (e.g., grep "^>" my_data.FASTA) with the [BS]AM header (e.g., samtools view -H my_data.BAM). This is a WARNING, meaning we're trying to make you aware of this issue. If you are sure you understand this you should be able to proceed. If you are incorrect and proceed anyway, you may be creating problems for yourself downstream. I suggest you take a minute to validate that the input order is identical and if so, proceed with caution.

Does that provide you with a solution? Brian