professor-greebie / SENG8080-1-field_project

Other
21 stars 32 forks source link

Basic Summary Statistics are done, discuss further approach. #121

Open gahlawat36 opened 1 year ago

gahlawat36 commented 1 year ago

I have completed a basic Summary statistics of the data and printed a few plots. But as Genome data is not our field of study, we need more help from the research team and the internet to get ideas for further analysis.

HareeshCon commented 1 year ago

Can we perform correlation or frequency analysis as well?

Correlation analysis will help us understand the relationship between the "Start" and "End" columns I think. But we need to also understand the data better for that.

Nishana08 commented 1 year ago

Absolutely, performing correlation and frequency analysis sounds like a great idea for further exploring the dataset. Correlation analysis can indeed provide valuable insights into the relationship between the 'Start' and 'End' columns. Before we delve into that, let's make sure we have a solid understanding of the data. To proceed, here's what we could do are Data Understanding, Visual Exploration, and Correlation Analysis.

HareeshCon commented 1 year ago

Hi @gahlawat36 @Sreekodavanti @harshal3107 @Nishana08 , I think we have the columns wrong in our 'EDA_Genome.ipynb' file.

I have added the correct column names below. Can we change it accordingly?

Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | transcript_id | gene_name | exon_number | exon_id

-- If we assume these column names, then we have a bit of clear data that we can use.

HareeshCon commented 1 year ago

I took help from section 2 and they have their data in gtf format. And hence using pyranges.read_gtf()

import pyranges as pr data = pr.read_gtf('path\dipOrd1.ensGene.gtf')

Will this work for us too?

HareeshCon commented 1 year ago
Dataset sample
gahlawat36 commented 1 year ago

Okay then I'll try to incorporate all these ideas into code.

gahlawat36 commented 1 year ago

I have converted gtf into Pandas dataframe. will try this new approach as well.

HareeshCon commented 1 year ago

https://github.com/zaneveld/full_spectrum_bioinformatics/blob/ef7cf521324597047d545263a49cea624e07dd32/content/06_biological_sequences/reading_and_writing_fasta_files.ipynb

This might help as well.