mdahasan / mClass---Multiple-cancer-classification

1 stars 2 forks source link

dataset Required #1

Open yasirniazi opened 3 years ago

yasirniazi commented 3 years ago

Hi dear, Hope so you are well and healthy. I just start working on it. I want to run this code for understanding complete work. for that purpose, I need the all_data.txt file required. the link given for the dataset is not understandable for me. So kindly provide a complete dataset for source code running. Thanks

mdahasan commented 3 years ago

Hello,

You can download the data from my shared drive. https://drive.google.com/file/d/1DWmKMHkZtmu7S-DuPF3IRTdnWR9gfyWj/view?usp=sharing

Hope this helps.

Thank you.

On Thu, Nov 12, 2020 at 5:02 AM yasirniazi notifications@github.com wrote:

Hi dear, Hope so you are well and healthy. I just start working on it. I want to run this code for understanding complete work. for that purpose, I need the all_data.txt file required. the link given for the dataset is not understandable for me. So kindly provide a complete dataset for source code running. Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mdahasan/mClass---Multiple-cancer-classification/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3WVZFAUT3Z5Y7O3XANIA3SPPMERANCNFSM4TTIF7SA .

-- Md. Abid Hasan Ph.D. Algorithms and Computational Biology Lab Department of Computer Science and Engineering Bourns College of Engineering University of California Riverside, CA 92521

Principal Scientist I Bioinformatics Roche Sequencing Solutions, Inc. Pleasanton, CA 94588

scchess commented 2 years ago

Thanks @mdahasan . Can I please also get a copy of the gene_snp_frequency.txt file missing in the repo? Thanks!

mdahasan commented 2 years ago

hi @scchess , I apologize, it's been many years, and this project isn't actively maintained. I was looking for the file you requested, but it seems like I can't find it. (Also, this is poor python code, my early work and not the best). However, I was looking at the code. I think the gene_snp_frequency.txt file is a product of 1_data_preprocess.py file. If you check this line https://github.com/mdahasan/mClass---Multiple-cancer-classification/blob/387ead12ac307b85b1a8e585ba027d681622fb6d/1_data_preprocess.py#L71 This should be the "per gene snp count". I'm not sure why this isn't stored in a file called gene_snp_frequency.txt but maybe you can just write all_sample_cancer_snp_data in a file name gene_snp_frequency.txt and that should work.

Again, I apologize for the inconvenience. As I said, it's an old work from an ignorant python coder.

scchess commented 2 years ago

What about?

import sys
import pandas as pd

df = pd.read_csv(sys.argv[1], sep="\t")
sums = dict(df.sum(axis=0))
x = dict(sums)
with open("gene_snp_frequency.txt", "w") as w:
    for gene in sums:
        if gene != "Cancer_type":
            w.write(gene + "\t" + str(sums[gene]) + "\n")
print("Generated: gene_snp_frequency.txt")
mdahasan commented 2 years ago

I can't say for sure if it'll work on not but seems like it should. The file gene_snp_frequency.txt is simply just the gene name and corresponding SNP count for that gene across all samples. Should be pretty straightforward.

scchess commented 2 years ago

Thanks. Looks like the file gene_snp_frequency.txt is working. However, running 6_feature_selection_with_mi.py got a missing All_Class_feature_MI_down.txt error. I'm not sure how to generate this file.