About Datasets - Githubissues

YOUKAINOYAMA commented 10 months ago

Hello, thank you for sharing the code. I would like to replicate your work, and I have obtained access permission for ADNI. However, I'm facing difficulties in selecting the data according to the descriptions in the paper. If possible, could you share the dataset you filtered from ADNI with me, or provide some guidance on how to select file names or table names on the official ADNI website? My email is sanshaveheart@outlook.com. Thanks again for your work.

sydat2701 commented 9 months ago

Hi, I'm facing the same problem with you. Did you solve it? If yes, could you share it with me, because I also really want to experience with this work. Thank you so much for your help.

hnchan13 commented 9 months ago

I am facing the same issue as well and also have access to ADNI. Would anyone be able to help please or point out any mistakes I have made (if any)? The following are my issues:

Genetic Data

I had to add the line if vcf_file.endswith(".gz"): inside the for loop for vcf_file in files: of the python script filter_vcfs.py to prevent .vcf.gz.tbi files from being processed as errors were returned.
For filter_vcfs.py, it seems that only .pkl files and "log.txt" will be generated, however, after iterating through all the files, that is, the ADNI WGS (GATK) data, not a single .pkl file was generated. Therefore, the only file output was log.txt containing which contain boolean values (nearly if not all are False). Issue: No pickle files generated, therefore unable to feed this data into downstream code concat_vcfs.py
I am struggling to find the labels for the genetic data used in the MADDI study i.e. for the python script concat_vcfs.py on line 12 diag = pd.read_csv("YOUR_PATH_TO_DIAGNOSIS_TABLE"), I am unable to locate the diagnosis table. Issue: Unable to find diagnosis table on ADNI website

Additional issues faced during genetic data pre-processing For : ./ADNI.808_indiv.minGQ_21.pass.ADNI_ID.chr3.vcf.gz

CSV reading complete vcf: <pandas.io.parsers.readers.TextFileReader object at 0x7fe95fb15790> Traceback (most recent call last): File "/home/user/Alzheimers/genetic_data/filter_vcfs.py", line 100, in main() File "/home/user/Alzheimers/genetic_data/filter_vcfs.py", line 61, in main vcf = pd.concat(vcf, ignore_index=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/lib/python3.11/site-packages/pandas/util/_decorators.py", line 331, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/lib/python3.11/site-packages/pandas/core/reshape/concat.py", line 368, in concat op = _Concatenator( ^^^^^^^^^^^^^^ File "/home/user/anaconda3/lib/python3.11/site-packages/pandas/core/reshape/concat.py", line 422, in init objs = list(objs) ^^^^^^^^^^ File "/home/user/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1698, in next return self.get_chunk() ^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1810, in get_chunk return self.read(nrows=size) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1778, in read ) = self._engine.read( # type: ignore[attr-defined] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read chunks = self._reader.read_low_memory(nrows) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pandas/_libs/parsers.pyx", line 820, in pandas._libs.parsers.TextReader.read_low_memory File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 817 fields in line 1476784, saw 833

hnchan13 commented 9 months ago

For the python script filter_vcfs.py on line 53, end = vcf_file.find("output.vcf"), it seems this value will always produce -1 given that none of the vcf_files contain "output.vcf", was this intended?

YOUKAINOYAMA commented 9 months ago

Sorry, I'm not quite sure. I plan to conduct an experiment on this paper using the dataset I collected myself, and the author cannot disclose this medical dataset to us.

发送自 Windows 10 版邮件https://go.microsoft.com/fwlink/?LinkId=550986应用

发件人: @.> 发送时间: 2023年12月7日 17:32 收件人: @.> 抄送: @.>; @.> 主题: Re: [rsinghlab/MADDi] About Datasets (Issue #16)

For the python script filter_vcfs.py on line 53, end = vcf_file.find("output.vcf"), it seems this value will always produce -1 given that none of the vcf_files contain "output.vcf", was this intended?

― Reply to this email directly, view it on GitHubhttps://github.com/rsinghlab/MADDi/issues/16#issuecomment-1844995704, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AY4FWYLA4RVMLMZ34CIDIFLYIGELJAVCNFSM6AAAAAA6QMXHWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBUHE4TKNZQGQ. You are receiving this because you authored the thread.Message ID: @.***>

rsinghlab / MADDi

About Datasets #16