Details about data collection

sabifo4 / mammals_dating

Methods for the analyses described in Álvarez-Carretero et al. 2022 (https://doi.org/10.1038/s41586-021-04341-1)

GNU General Public License v3.0

18 stars 4 forks source link

Details about data collection #56

Closed MinjieHu closed 2 years ago

MinjieHu commented 2 years ago

Hi Sandra,

Congrats on your new publication in Nature! It's super interesting!

I am interested in applying your method in cnidarians. After looking through your description of data collections, I am still confused about how you collect the data. Based on your gene filtering script, it looks you are starting with multiple gene alignments. Could you give me more detailed instructions about how you get the alignment file from the Emsebl biomart?

Thanks in advance for the help!

Minjie Hu

sabifo4 commented 2 years ago

Hi Minjue Hu,

thank you very much for your feedback!

Asif Tamuri, @tamuri, was in charge of the data collection -- I started this GitHub repository with the dataset that had already been assembled before I joined the project. He might be able to assist you better with this specific matter!

All the best, Sandra

tamuri commented 2 years ago

Is there something specific you're having trouble with? Biomart allows you to build up a filter for the dataset you'd like to download. For example, you can:

Select 'Ensembl Genes' for Database
'Human genes' for Dataset
Add filter for Gene Type = 'protein_coding'

That will select all the protein coding genes for the human dataset. Then do the same for the other species of interest. For the other species, you would download those genes for which there is a human orthogue, using the filter for 'Multi Species Comparison'.

This requires getting familiar with the Biomart interface and the various filters in Ensembl Biomart.

MinjieHu commented 2 years ago

Hi Sandra and tamuri, Thanks for your super fast response.

@tamuri, I am just a little confused about how you collect your data. Based on what you described, you just got protein-coding sequence from Biomart, am I right? Can you also clarify how did you generate one-to-one protein-coding orthologues? And based on the gene filtering script, it looks like that the fasta alignment file is the input. How did you generate the alignment file, is there any particular alignment method you are using?

Thanks again for the help.

Minjie

tamuri commented 2 years ago

Based on what you described, you just got protein-coding sequence from Biomart, am I right?

Yes. You can get the orthologues of each species from Biomart using the filters described above. The peptide and/or CDS can be selected in the 'Attributes' options. Have a look at this short tutorial. There's also a video.

Can you also clarify how did you generate one-to-one protein-coding orthologues?

Look at the 'homology type', which is one of the attributes. That described the relationship.

And based on the gene filtering script, it looks like that the fasta alignment file is the input. How did you generate the alignment file, is there any particular alignment method you are using?

We used a program called PRANK.

MinjieHu commented 2 years ago

Hi tamuri,

Thanks for clarifying. It's clearer to me now.