Open gahlawat36 opened 1 year ago
Can we perform correlation or frequency analysis as well?
Correlation analysis will help us understand the relationship between the "Start" and "End" columns I think. But we need to also understand the data better for that.
Absolutely, performing correlation and frequency analysis sounds like a great idea for further exploring the dataset. Correlation analysis can indeed provide valuable insights into the relationship between the 'Start' and 'End' columns. Before we delve into that, let's make sure we have a solid understanding of the data. To proceed, here's what we could do are Data Understanding, Visual Exploration, and Correlation Analysis.
Hi @gahlawat36 @Sreekodavanti @harshal3107 @Nishana08 , I think we have the columns wrong in our 'EDA_Genome.ipynb' file.
I have added the correct column names below. Can we change it accordingly?
Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | transcript_id | gene_name | exon_number | exon_id
-- If we assume these column names, then we have a bit of clear data that we can use.
I took help from section 2 and they have their data in gtf format. And hence using pyranges.read_gtf()
import pyranges as pr data = pr.read_gtf('path\dipOrd1.ensGene.gtf')
Will this work for us too?
Okay then I'll try to incorporate all these ideas into code.
I have converted gtf into Pandas dataframe. will try this new approach as well.
I have completed a basic Summary statistics of the data and printed a few plots. But as Genome data is not our field of study, we need more help from the research team and the internet to get ideas for further analysis.