saphir746 / BiobankRead-Bash

Python scripts to extract and pre-process UKB data
GNU General Public License v3.0
30 stars 8 forks source link

Scaling, hierarchical tree parsing, general questions #12

Open hoangthienan95 opened 5 years ago

hoangthienan95 commented 5 years ago

Hi there! Thanks so much for the awesome package. I was in the process of writing my own phenotype parser when I found out about this, it saved me a lot of time and also provided guidance for the use cases that are specific to me.

I have some questions about the package:

  1. I was wondering if you can comment on the scaling ability of the package? I see that the package mostly uses numpy and pandas, which I assume loads all the data into memory. Will this be a problem when the dataframe queried is very big (a large number of phenotypes at a time), or when the UKB add more phenotype and more people? Are there any cases where you see the package takes performance hits or results in out of memory error?
  2. Is there currently a functionality that, for a hierarchical categorical attribute, grabs all the levels below a specific attribute? For example, if I put White for ethnic background, it would give all people who are either "White", "British", "Irish", and "Other white background"?
  3. Do you have a way of saving the newly-created, complex, phenotype definitions and/or filters for later quick reference/reproducibility?
  4. I see that you parse the html file for the field-related information. Aside from the html being UKB data access application specific, is there a reason why the data dictionary csv was not used? I'm currently using it and wonder if you avoided it because of a specific reason.
  5. Are you currently working on adding to the documentation and use cases? I'd be more than happy to document and write up my use of the package as part of the example for other people to use.

Thanks!

msolati commented 4 years ago

Hi there,

These are great questions that I have too. Have you been able to find the answers?

Thanks