vanheeringen-lab / ANANSE

Prediction of key transcription factors in cell fate determination using enhancer networks. See full ANANSE documentation for detailed installation instructions and usage examples.
http://anansepy.readthedocs.io
MIT License
77 stars 16 forks source link

Ananse network is not using genomepy annotations or genomes #183

Closed alexgilgal closed 2 years ago

alexgilgal commented 2 years ago

Hi! I have been using ananse with a different genome than hg38 or mm10. The genome in question is Astyanax_mexicanus-2.0, that I have installed from Ensembl using genomepy. I have generated the corresponding motif2factors file with GimmeMotifs in order to be able to run ananse binding. The step of ananse binding has been completed successfully, and everything looks fine using ananse view. The problem comes when I use ananse network. I use the name of the genome in genomepy, Astyanax_mexicanus-2.0. The problem is that I get this message from this step: Please provide a gene bed file with -a argument. As I get from the documentation, it should take the annotation from genomepy, but it doesn't. When I provide the annotation that is in the genomepy folder, there are conflicts between the GeneIds of the expression files, the TF names (gene names), and the annotation (which is the Ensembl transcript ID). So far, I have been able to put the expression files in the same gene format than the binding file, but I have been unsuccessful to pass the annotation to the same format. Is there a way to generate this annotation file with genomepy?

These are the program versions that I'm managing: ananse v0.3.0 GimmeMotifs v0.17.0 genomepy, version 0.12.0

I also attach the motif2factors generated with GimmeMotifs. The species used for inferring orthology are the defaults one. The expression files used for the ananse network step use this geneID format.

Thanks a lot for this tool, is great!

Astyanax_mexicanus-2.0.gimme.vertebrate.v5.0.motif2factors.txt

simonvh commented 2 years ago

expression files, the TF names (gene names), and the annotation (which is the Ensembl transcript ID)

Can you give an example (first few lines) of these different data?

alexgilgal commented 2 years ago

Sure! Here is a head of these files:

Expression files:

tpm
a1cf    19.830643
aaas    13.544051
aacs    0.904647
aadac   2.018107
aadacl4 8.310408
aadat   7.847417
aagab   12.512527
aak1b   1.719697
aamdc   9.310341

The motif2factors.txt looks like this:

Motif   Factor  Evidence    Curated
GM.5.0.Sox.0001 sox7    Orthologs   N
GM.5.0.Sox.0001 sox17   Orthologs   N
GM.5.0.Sox.0001 sox18   Orthologs   N
GM.5.0.Sox.0001 sox9a   Orthologs   N
GM.5.0.Sox.0001 sox9b   Orthologs   N
GM.5.0.Sox.0001 sox12   Orthologs   N
GM.5.0.Sox.0001 SOX4    Orthologs   N
GM.5.0.Sox.0001 sox4a   Orthologs   N
GM.5.0.Sox.0001 sox13   Orthologs   N

And finally here is the annotation BED that is inside the genomepy folder:

3   34647819    34675437    ENSAMXT00000018611  0   +   34672181    34675437    0   12  638,188,64,904,85,37,148,132,90,144,185,133,    0,22416,24357,24506,25498,25669,25895,26127,26390,26675,26926,27485,
3   34670084    34676122    ENSAMXT00000030998  0   +   34672181    34675437    0   11  339,64,904,85,37,148,132,90,144,185,818,    0,2092,2241,3233,3404,3630,3862,4125,4410,4661,5220,
3   34663926    34669232    ENSAMXT00000040962  0   -   34663926    34669232    0   8   163,203,104,76,52,40,48,25, 0,280,1352,1554,1714,1929,2054,5281,
3   34676142    34680481    ENSAMXT00000032393  0   -   34676142    34679504    0   7   236,180,130,170,140,94,1035,    0,635,900,1199,1626,1876,3304,
3   34676142    34680481    ENSAMXT00000018591  0   -   34676142    34679504    0   8   236,180,130,170,140,94,62,878,  0,635,900,1199,1626,1876,3304,3461,
3   34685832    34693041    ENSAMXT00000018568  0   -   34688692    34692676    0   2   3672,813,   0,6396,
3   34696413    34739618    ENSAMXT00000039580  0   -   34698935    34739170    0   10  2603,100,56,114,107,109,78,151,58,554,  0,12780,24482,27595,28647,39961,40525,41015,41746,42651,
3   34743500    34751717    ENSAMXT00000055970  0   -   34747139    34751368    0   4   4259,82,84,472, 0,5651,7554,7745,
3   34743500    34751717    ENSAMXT00000055145  0   -   34747139    34751537    0   5   4259,82,84,79,263,  0,5651,7554,7745,7954,
3   34807134    34820122    ENSAMXT00000018553  0   +   34807169    34820122    0   6   224,122,103,69,139,131, 0,10834,11624,11877,12615,12857,

I have realized that I'm running the stable version of Ananse, and I'm giving a try to the develop branch of it. I saw that someone asked if ananse can be used with non-model organisms in an issue and that you guys recommended to use the develpment branch. Using this development branch the geneIDs are transformed in the BED and in the expression file is needed, so I think this solves this particular problem. So far, I'm running now the ananse network without problems now. Let's see how it goes! Again, thanks for your work!

simonvh commented 2 years ago

Ah yes, the current develop branch is much improved.. Hope that works for you, if not, feel free to update and/or reopen the issue :)