GEN-reader? - Githubissues

hoangthienan95 commented 5 years ago

Thanks for this great package! I tried alot of stuff to read bgen files from UKBIOBANK (including Hail) and yours just work beautifully with Dask. It saved me alot of work!

Do you have a version of this but for GEN files? The Dask integration is what I'm looking for.

horta commented 5 years ago

Thank you, @hoangthienan95 =) Do you have an example of gen file?

horta commented 5 years ago

Btw, we have reader for gen files implemented in limix: https://github.com/limix/limix/blob/master/limix/io/gen.py

I'm just not sure if I cover all cases of gen files.

hoangthienan95 commented 5 years ago

Hi @horta I have a ".txt" file with the format below:

Rsid_info Position Ref_allele Alt_allele Triplicate_prob1 Triplicate_prob2 Triplicate_prob3
rs145615430:56:C:T 56 C T 1 0 0

Since the genetic data has 127,717 subjects, so it has about ~400k columns, which makes it hard to read using Dask directly (see https://github.com/dask/dask/issues/5365). Having the genotype being a Dask delayed array like you have with the bgen reader, with each element being the matrix for all subjects for that genotype would be perfect. Do you have any recommendations on what to do/what package to use?

horta commented 5 years ago

Dask would be my best bet. It seems from the other thread that you are using space as a separator, while your file has | as a separator. Isn't that a problem? Also, the Triplicate_prob column seems to encompass three columns, it might confuses dask/pandas.

Another point of concern is that CSV files (like ben files) might have different sized rows, which means that you cannot jump to the, lets say, 10th row without reading every single byte of the previous rows. This drawback is often handled by creating an accompanying index file associated with that CSV file. Such an index file will have the position (in number of bytes) of the start of every row, so that you can readily jump to any row you want without visiting again every single byte.

(Of course, the tool that creates such an index file would need to read every single byte at first).

Is there no tool that convert such a file to bgen, for example?

hoangthienan95 commented 5 years ago

@horta sorry for the confusion, I edited my sample format for clarity. The separator is indeed space and the triplicates are in three columns per subject.

Indexing it is a great idea. I didn't find a csv indexer after a quick search. However, in order to index it I would have to convert it to a binary format like BGEN anyway right?

QCtool seems promising in converting GEN to BGEN, so I'll try it today and then use the bgen-reader on it. However, I'm afraid my file is not in a typical GEN format (the first column looks different than all the GEN files I have seen, so fingers crossed.

I'll also ask the consortia who have been producing this dataset what tool they use to wrangle it.

When you have a weirdly-formatted, big file like this and your GEN-reader fails, what tool do you fall back on? PLINK?

horta commented 5 years ago

Hi @hoangthienan95 , I had forgotten about this thread. I dont have much experience with that to be honest. GEN is text file so trying to convert it to PLINK binary one is something i would try. Sorry for not being more useful here.

limix / bgen-reader-py

GEN-reader? #19