intervene-EU-H2020 / synthetic_data

Software program for generating synthetic datasets for genotypes and phenotypes
GNU General Public License v3.0
13 stars 3 forks source link

2 questions about the synthetic dataset available on BioStudies #32

Open alicegranb opened 3 months ago

alicegranb commented 3 months ago

Hello, Thanks for this great resource!

(1) During our attempt to convert the chromosome 1 PLINK files to VCF format or apply filtering, we encountered a size issue with the .bed file, resulting in the following error message: "Error: Invalid .bed file size (expected 134448552003 bytes)."

Upon reviewing the file, we noticed that the size indicated at the beginning of the download was 134450064003 bytes, which aligns with the actual size of the downloaded file. "Length: 134450064003 (125G) [application/vnd.realvnc.bed"

Our hypothesis is that the BED file is corrupt and that the information it contains does not correspond to that in the FAM and BIM files.

The conversion/filtering worked for all other chromosomes PLINK files.

Could you kindly assist us in resolving this matter? Can you check on your side what is the true BED file size? Any guidance or correction you could provide would be greatly appreciated.

(2) Some variants have inverted REF and ALT columns, while the ID column is accurate. What could be the cause this?

Thank you in advance for your support. Alice