starskyzheng / panpop

Application of pan-genome for population
MIT License
98 stars 9 forks source link

A significant amount of variant loci information on some chromosomes is missing after merging #47

Closed polchan closed 6 months ago

polchan commented 6 months ago

Dr. Zheng,

Hello! Happy holidays!

I previously used PART_run.pl (starskyzheng-patch-2) to merge structural variant (SV) data from second-generation population sequencing data generated by VG giraffe. Recently, when using the merged data, I noticed that information on variant loci on some chromosomes is missing. Specifically, I constructed population data for Malus, which has 17 chromosomes, but after merging, the variant loci on chromosomes 12-17 were missing. I am concerned that there may have been an error in my previous operation. After rerunning the process recently, I found that the problem still persists, and this time the variant loci on chromosomes 8-17 have disappeared from the results. I apologize for disturbing you during the holiday, but if you have time, could you please help me clarify this issue?

Best regards,

Bocheng

polchan commented 6 months ago

I carefully checked the run log and found that the problem might have occurred during the "reading ref fasta" step, as not all sequences were read in. When executing bcftools sort OUTDIR_RUN1/2.thin1.unsorted.vcf.gz -o OUTDIR_RUN1/2.thin1.sorted.vcf.gz --temp-dir tmp/tmp_bcftools, a warning appeared: [W::bgzf_read_block] EOF marker is absent. The input may be truncated. This could be due to the file being too large, which may have led to the error. I am now preparing to try splitting the VCF and running it again.

1714899244801
starskyzheng commented 6 months ago

You may need check file OUTDIR_RUN1/2.thin1.unsorted.vcf.gz by using gzip -t xx.vcf.gz.

polchan commented 6 months ago

yes, I have checked 2.thin1.unsorted.vcf.gz using grep, and discovered that it is missing some chromosomes.

starskyzheng commented 6 months ago

I see. The log shows that 1.realign0.unsort.vcf.gz seems broken. May be you could also check this file.

polchan commented 6 months ago

Sorry! I deleted the previous results. Now, when I process each chromosome separately, I don't encounter the issue that was occurring before. I think the problem has been resolved. It might have been caused by the vcf file being too large. Thank you for your response!