manqizhou / moETM

4 stars 0 forks source link

How to determine gene coding and protein coding #2

Closed KID-KK closed 4 months ago

KID-KK commented 5 months ago

Hello, I have a question. In the "useful file" folder, how are "is gene coding" and "is protein coding" determined in the files gene_coding_nips_rna_atac.csv, gene_coding_nips_rna_protein.csv, and protein_coding_nips_rna_protein.csv? What is their specific purpose?

manqizhou commented 4 months ago

Hi. Thank you for asking.

For the 'gene_coding_xxx' files, the 'is_gene_coding' column is determined using a R function getBM(values=gene_ids, filters = "external_gene_name", mart=mart, attributes = c('external_gene_name','gene_biotype')) in the bioMart package. If the returned gene_biotype attribute is 'protein_coding', then this gene is a protein-coding gene, thus its 'is_gene_coding' column value will be 1.

For the 'protein_coding_xxx' files, the 'is_protein_coding' column is 1 for all proteins unless the protein is a protein complex and does not have an Ensembl ID (such as TCR).

We use protein-coding genes because they are well-studied and there are many known regulatory relationships. Although many housekeeping genes also play an important role and show importance in our model, it is challenging to interpret them as they are not as well-studied compared to protein-coding genes.

I hope this answers your question.