How to work with multiple vcf files?

zh-zhang1984 commented 3 years ago

Hi, I can work with one file using vcfR, but when I have >100 .vcf files, each file representing one sample; these samples are biological replicates; How may I perform down stream analysis, possibly by merging these files together?

knausb commented 3 years ago

Hi, @zh-zhang1984, in theory you could cbind() the GT slots. I'm going to recommend against that. Recall that VCF data only includes the variable positions. Because different samples are likely to have different variable positions, so the rows will not line up. For example, line 200 for sampleA may be CHROM 1 and POS 2533 while sampleB line 200 may be CHROM 2 and POS 125. You would have to sort this out and it sounds like a lot of work. If you did sort it out you would have a lot of holes in your data matrix. I think the most appropriate way to deal with that would be to insert NAs. But that will present a challenge for analysis. You could insert genotypes that are homozygous to the reference, but then you are calling genotypes in the absence of any data. If you recall the variants over all of the samples you wish to analyze the variant caller should insert homozygous to the reference where it has data and NAs when it does not have data. My recommendation is that you should recall your variants. Good luck!

zh-zhang1984 commented 3 years ago

Thank you very much for your comments. That really gives much help.

Dr. Zhongheng Zhang Department of emergency medicine, Sir Run-Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, 310016, China. Members of the Zhejiang society of evidence based medicine and epidemiology Section editor of Journal of thoracic disease https://www.researchgate.net/profile/Zhongheng_Zhang https://publons.com/author/239344/zhongheng-zhang#profile

From: Brian Knaus notifications@github.com Sent: Thursday, January 21, 2021 7:41 PM To: knausb/vcfR vcfR@noreply.github.com Cc: Zhongheng Zhang zh_zhang1984@hotmail.com; Mention mention@noreply.github.com Subject: Re: [knausb/vcfR] How to work with multiple vcf files? (#177)

Hi, @zh-zhang1984https://github.com/zh-zhang1984, in theory you could cbind() the GT slots. I'm going to recommend against that. Recall that VCF data only includes the variable positions. Because different samples are likely to have different variable positions, so the rows will not line up. For example, line 200 for sampleA may be CHROM 1 and POS 2533 while sampleB line 200 may be CHROM 2 and POS 125. You would have to sort this out and it sounds like a lot of work. If you did sort it out you would have a lot of holes in your data matrix. I think the most appropriate way to deal with that would be to insert NAs. But that will present a challenge for analysis. You could insert genotypes that are homozygous to the reference, but then you are calling genotypes in the absence of any data. If you recall the variants over all of the samples you wish to analyze the variant caller should insert homozygous to the reference where it has data and NAs when it does not have data. My recommendation is that you should recall your variants. Good luck!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/knausb/vcfR/issues/177#issuecomment-764890345, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLCMAFGFSNY34ZWSF6SMMTS3B7OLANCNFSM4WL4ZO2A.

knausb / vcfR

How to work with multiple vcf files? #177