mrmckain / clone_reducer

Clone reducing script used in Estep et al. (2014).
4 stars 1 forks source link

Do you have any example data to run through this script please #3

Closed baoxingsong closed 5 years ago

mrmckain commented 5 years ago

The data from the Estep et al. paper would be an example.

I use different a version now where I have altered the code a bit to fit my usual style of naming sequences for phylotranscriptomics (SpeciesID-trinity_name).

What do your sequence IDs look like?

baoxingsong commented 5 years ago

I am performing a similar analysis following Estep et al. paper. Toby also involves in this.

My sequence ID looks like this

sugarcane_Sspon.03G0002050-4D-mRNA-1 sugarcane_Sspon.03G0002050-2B-mRNA-1 msinensis_Misin06G339300.1.v7.1 msinensis_Misin05G367300.1.v7.1 sorghum_KXG33778 maize_Zm00001d012260_T001 maize_Zm00001d042686_T001 setaria_KQL07915 Elionorus_tripsacoides_KXG33778_contig_64824_87351_88914 Elionorus_tripsacoides_KXG33778_contig_98919_20112_21678 Elionorus_tripsacoides_Zm00001d042686_T001_contig_14439_98880_100451 Elionorus_tripsacoides_Zm00001d042686_T001_contig_48493_19595_21163 Hyparrhenia_diplandra_Zm00001d042686_T001_contig_785_137404_138975 Miscanthus_junceus_Zm00001d042686_T001_contig_25598_8405_9975 Miscanthus_junceus_Zm00001d042686_T001_contig_1168_89048_90620

I generated those sequences using my own python script. So if you could show me an example of data, I could reformate my IDs into your format.

And could please share a phylogenetic tree, I did not figure out which output file of RAxML should be used as input for this script.

In the publication, there mentioned, "We used individual gene-tree topologies as a guide to identify and concatenate paralogues from the same genome for each accession." Do you know this step was done using a script or manually, please?

mrmckain commented 5 years ago

Here is an example of what the IDs should look like: >Acorus_americanus-TRINITY_DN12742_c0_g1_i1

Acorus_americanus-TRINITY_DN12742_c0_g1_i2 Acorus_americanus-TRINITY_DN15327_c0_g1_i1 Acorus_americanus-TRINITY_DN17363_c0_g2_i1 Acorus_americanus-TRINITY_DN17363_c0_g2_i3 Acorus_americanus-TRINITY_DN17985_c0_g2_i1 Acorus_americanus-TRINITY_DN17985_c0_g2_i2 Acorus_americanus-TRINITY_DN8911_c0_g1_i1 Ananas_comosum-Aco006534.1 Ananas_comosum-Aco008271.1 Ananas_comosum-Aco019567.1 Ananas_comosum-Aco020510.1 Asparagus_officinalis-evm.model.AsparagusV1_02.1827 Asparagus_officinalis-evm.model.AsparagusV1_05.2482 Asparagus_officinalis-evm.model.AsparagusV1_05.3504 Chamaedorea_seifrizii-TRINITY_DN13659_c0_g1_i1 Chamaedorea_seifrizii-TRINITY_DN13659_c0_g1_i2 Chamaedorea_seifrizii-TRINITY_DN19697_c0_g1_i1 Chamaedorea_seifrizii-TRINITY_DN19697_c0_g1_i2 Chamaedorea_seifrizii-TRINITY_DN27655_c2_g2_i1 Cocos_nucifera-TRINITY_DN38308_c0_g1_i1 Cocos_nucifera-TRINITY_DN38308_c0_g1_i2 Cocos_nucifera-TRINITY_DN46420_c3_g1_i1 Cocos_nucifera-TRINITY_DN46420_c3_g1_i2 Cocos_nucifera-TRINITY_DN46420_c3_g1_i3 Cocos_nucifera-TRINITY_DN46420_c3_g1_i4 Cocos_nucifera-TRINITY_DN46420_c3_g1_i5 Cocos_nucifera-TRINITY_DN46420_c3_g1_i6 Cocos_nucifera-TRINITY_DN46420_c3_g1_i7 Cocos_nucifera-TRINITY_DN46515_c3_g1_i2 Cocos_nucifera-TRINITY_DN46515_c3_g1_i4 Cocos_nucifera-TRINITY_DN46515_c3_g1_i5 Cocos_nucifera-TRINITY_DN46515_c3_g1_i6 Cocos_nucifera-TRINITY_DN46515_c3_g1_i8 Costus_pulverulentus-TRINITY_DN44990_c0_g1_i2 Costus_pulverulentus-TRINITY_DN44990_c0_g1_i3

All that matters is that there is a "-" between the taxon ID and the rest of the name. If you are using different accessions for a taxon, then have that info before the "-" if you want to treat them separately.

For the RAxML output, use the bipartitions output when you run a bootstrap run.

That part was done manually to assign how the paralogs go together. We put them together using a script.