vanheeringen-lab / gimmemotifs

Suite of motif tools, including a motif prediction pipeline for ChIP-seq experiments. See full GimmeMotifs documentation for detailed installation instructions and usage examples.
https://gimmemotifs.readthedocs.io/en/master
MIT License
110 stars 33 forks source link

gimme motif2factors: No Orthologs Found for any factor in my custom organism genome #307

Open apposada opened 1 year ago

apposada commented 1 year ago

Hi! I hope all is going good!

I am trying to run gimme motif2factors (v0.17.1, installed through conda/mamba) on the genome of my organism to be able to run ANANSE. I set up a folder for a custom genome in ~/.local/share/genomes/MyOrganism as I have previously done for other organisms on other occasions:

alberto@nostromo:~/.local/share/genomes/Smed$ ls -lha
total 205M
drwxrwxr-x  4 alberto alberto 4.0K Jun 28 11:30 .
drwxrwxr-x 16 alberto alberto 4.0K Jun 27 19:47 ..
-rw-rw-r--  1 alberto alberto 194M Jun  1 13:34 Smed.annotation.bed
lrwxrwxrwx  1 alberto alberto   78 Jun  1 13:10 Smed.annotation.gtf -> ~/projects/smed_cisreg/data/standard_references/schMedS3_h1.gtf
-rw-rw-r--  1 alberto alberto    0 Jun  1 16:47 Smed.blacklist.bed
lrwxrwxrwx  1 alberto alberto   71 Jun  1 13:10 Smed.fa ->  ~/DATA/dynamic/genomes/Smed/Rink/schMedS3/schMedS3_h1.fa
lrwxrwxrwx  1 alberto alberto   71 Jun  1 13:12 Smed.fa.fai -> ~/projects/dev/pipe/scbe_pipe/metadata/schmidtea.fa.fai
-rw-rw-r--  1 alberto alberto  12K Jun  1 13:30 Smed.fa.sizes
-rw-rw-r--  1 alberto alberto 4.6K Jun  1 13:40 Smed.gaps.bed
-rw-rw-r--  1 alberto alberto  12M Jun  1 18:14 Smed.pep.fa
-rw-rw-r--  1 alberto alberto  14K Jun  1 16:47 Smed.seq2scienceblacklist_complement_encode_mito.bed
-rw-rw-r--  1 alberto alberto    0 Jun  1 16:47 Smed.seq2scienceblacklist_encode_mito.bed
drwxrwxr-x  2 alberto alberto 4.0K Jun  1 16:50 genome_sizes
drwxrwxr-x  3 alberto alberto 4.0K Jun  1 16:47 index

In the process of trying to run motif2factors, I am getting the following issue. When trying the following:

gimme motif2factors --new-reference Smed --threads 12 --tmpdir temp --keep-intermediate

I get the following:

19:52:23 | INFO | Making a new reference for: Smed.
19:52:23 | INFO | You say the gimme.vertebrate.v5.0 database is based on: GRCh38.p13 & GRCm38.p6.
19:52:23 | INFO | For better orthology inference we are also using these assemblies: danRer11 & UCB_Xtro_10.0 & GRCg6a & BraLan2 & oryLat2 & ARS-UCD1.2 & phaCin_unsw_v4.1 & rCheMyd1.pri.
19:52:23 | INFO | Using lenient strategy for orthology/name inference.
19:52:23 | INFO | genomes_dir: /mnt/sda/alberto/.local/share/genomes
19:52:23 | INFO | tmpdir: temp
19:52:23 | INFO | outdir: .
19:52:23 | INFO | Downloading all assemblies.
19:52:23 | INFO | BraLan2 was already downloaded, using that version.
19:52:23 | INFO | GRCh38.p13 was already downloaded, using that version.
19:52:23 | INFO | GRCm38.p6 was already downloaded, using that version.
19:52:23 | INFO | found pep.fa for Smed, using that one.
19:52:23 | INFO | danRer11 was already downloaded, using that version.
19:52:23 | INFO | rCheMyd1.pri was already downloaded, using that version.
19:52:23 | INFO | GRCg6a was already downloaded, using that version.
19:52:23 | INFO | phaCin_unsw_v4.1 was already downloaded, using that version.
19:52:23 | INFO | oryLat2 was already downloaded, using that version.
19:52:23 | INFO | UCB_Xtro_10.0 was already downloaded, using that version.
19:52:23 | INFO | ARS-UCD1.2 was already downloaded, using that version.
19:52:23 | INFO | Taking the longest protein per gene per assembly.
19:52:23 | INFO | Processing BraLan2.
19:52:31 | INFO | Processing GRCh38.p13.
19:52:31 | INFO | Processing GRCh38.p13.
19:53:21 | INFO | Processing GRCm38.p6.
19:53:51 | INFO | Processing Smed.
19:53:53 | INFO | Processing danRer11.
19:54:09 | INFO | Processing rCheMyd1.pri.
19:54:30 | INFO | Processing GRCg6a.
19:54:49 | INFO | Processing phaCin_unsw_v4.1.
19:55:15 | INFO | Processing oryLat2.
19:55:20 | INFO | Processing UCB_Xtro_10.0.
19:55:41 | INFO | Processing ARS-UCD1.2.
19:56:07 | INFO | Running orthofinder to find orthologs.
[...]

And Orthofinder runs well, but I get INFO | No orthologs found for [...], for ALL the motifs in the database. Leading to a motif2factors.txt file where the column factor just says NO ORTHOLOGS FOUND. First I tried adding more species from genomepy in case of poor phylogenetic signal to transfer TF annotation to the genome of my organism, but this did not work either (I got the same result).

$ wc -l Smed.gimme.vertebrate.v5.0.motif2factors.txt 
1797 Smed.gimme.vertebrate.v5.0.motif2factors.txt
$ grep "NO ORTHOLOGS FOUND" Smed.gimme.vertebrate.v5.0.motif2factors.txt | wc -l
1796 #first line of the table is column names hence 1 number difference

I have ran motif2factors several times in the past, with other genomes from other organisms, and I have never found this problem. What do you think might be happening here?

Cheers, Alberto

siebrenf commented 1 year ago

Hey Alberto,

gimme motif2factors uses orthofinder to connect genes in the motif database to the gene annotation. It does this based on amino acid sequence similarity. The default motif database is the vertebrate motif database, while your species is invertebrate, which makes it hard to map one to the other.

There are two solutions to this problem. 1) select the invertebrate database 2) add invertebrate assemblies to the orthofinder analysis, to try to bridge the gap between assemblies.

When we last collaborated I went for option 2. Here are my notes from back then:

Invertebrate assemblies with gene annotations

worm model organisms

plenaria model organisms

other invertebrate model organisms

Command

gimme motif2factors \
--new-reference Smed \
--ortholog-references \
GRCz11 GRCg6a UCB_Xtro_10.0 BraLan2 oryLat2 ARS-UCD1.2 phaCin_unsw_v4.1 rCheMyd1.pri \
Mlig_3_7 El_Paco BDGP6.32 Hydra_RP_1.0 WBcel235 KH ASM20922v1 \
--genomes_dir ~/.local/share/genomes \
--tmpdir /path/to/output/folder/tmp \
--outdir /path/to/output/folder \
--threads 48 \
--keep-intermediate

There were some additional problems with the Smed assembly back then, so if you still have those, check the care package I uploaded back them. You can email be at siebrenf@science.ru.nl if you lost the link :)

siebrenf commented 1 year ago

if these are the same organisms you tried earlier, you could also try checking your orthofinder version. Seems they updated recently, so perhaps that changed the output?

apposada commented 1 year ago

Hi Siebren,

Thank you for your message! These were not the species I used (nor the motif DB was the same) when I tried earlier. I will try with these parameters you suggested right away, and will let you know.

I do still have the care package that you sent us :) I managed to re-run it on a previous iteration of the genome to check how it worked on my hands and all went good; it was an amazing feat of work! I will be relying on it for this version of the genome too; I will keep you updated on all of this through email soon.

Cheers, Alberto

apposada commented 1 year ago

Hi again,

I left it running yesterday with the same species you suggested (except Hydra, because genomepy kept telling me no annotation was available). I have just checked the output and I still get a motif2factors.txt without any transferred factor (i.e. NO ORTHOLOGS FOUND for all factors). I have checked Orthofinder and I have version 2.5.4 on my system. Could it be this?

Cheers, Alberto

siebrenf commented 1 year ago

During our colab, the Smed annotation did not have start and stop codons annotated which was causing issues. My hypothesis is therefore that there is something wrong/missing with your new annotation as well.

Is ~/.local/share/genomes/Smed/Smed.pep.fa empty perhaps? If you so can make one yourself. Here was how I did that last time:

1: get all CDS:

btw, update the paths to fit your folder structure

bedtools getfasta \
-fi /bank/genomes/Smed/Smed.fa \
-bed /bank/genomes/Smed/Smed.annotation.bed \
-nameOnly -s -split \
-fo /bank/genomes/Smed/Smed.pep.CDS

2: convert each transcript to peptide

from Bio.Seq import translate
import re

cdsfile = "/bank/genomes/Smed/Smed.pep.CDS"  # all exons per gene
pepfile = "/bank/genomes/Smed/Smed.pep_all.fa"  # peptides for all transcripts

# NCBI has no planarian nuclear codon table
# https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

n = 0
try:

    with open(cdsfile) as cds, open(pepfile, "w") as pep:
        for line in cds:
            if line.startswith(">"):
                # remove strand information from gene "name"
                line = re.sub(r"\([+-\.]\)", "", line)
            else:
                # convert NUC to AA
                line = line.strip()
                # for each frame, get all ORFs, select the longest
                # (without it, the "peptides" are clearly borked)
                f1 = translate(line)
                f2 = translate(line[1:])
                f3 = translate(line[2:])
                orfs = f1.split("*") + f2.split("*") + f3.split("*")
                line = max(orfs, key=len)
                line += "\n"
            pep.write(line)
            n += 1

except Exception as e:
    print(f"wrote {n} lines")
    raise e

3: take the longest peptide per gene

import pyfaidx
import genomepy

# filter for the longest transcript per gene
infile = "/bank/genomes/Smed/Smed.pep_all.fa"  # peptides for all transcripts
outfile = "/bank/genomes/Smed/Smed.pep.fa"  # peptides for the longest transcripts

a = genomepy.Annotation("/bank/genomes/Smed")
g2t = a.gtf_dict("gene_name", "transcript_id")
pep = pyfaidx.Fasta(infile)
with open(outfile, "w") as out:
    for gene in g2t:
        transcripts = {}
        for tid in g2t[gene]:
            transcripts[pep[tid][:].seq] = tid
        seq = max(transcripts.keys(), key=len)
        header = transcripts[seq]
        out.write(f">{header}\n{seq}\n")

the output Smed.pep.fa file is located inside the genomepy Smed assembly folder. Once there, you can rerun gimme motif2factors and it will use the file as-is, instead of generating its own.

apposada commented 1 year ago

Hi Siebren,

Thanks for your detailed help!

It may well be that something is off/different between annotation versions, but an empty pep.fa is not the case as I generated my pep.fa using transdecoder. I did this following standard guidelines as found in transdecoder's documentation; see also https://github.com/scbe-lab/pristina-cell-type-atlas/blob/main/pristina_transcriptomic_landscape/code/R_markdowns/00_TF_annotation/00_TF_annotation.Rmd and https://github.com/apposada/gene_annot for examples of how I have done it in recent, previous cases. By all of this I mean that the pep.fa is not empty and contains protein sequences. I could send you the pep.fa if this could help troubleshooting. Could it be related to some discrepancy between gene names/transcript names in the gtf and the peptide names in the pep.fa? As far as I know the sequences in my pep.fa have the same name as the transcript they were retrieved from (meaning: not the ID of the "gene" feature in the GTF, but the "transcript" feature in the GTF):

$ grep ">" ~/.local/share/genomes/Smed/Smed.pep.fa | wc -l
26894
$ head ~/.local/share/genomes/Smed/Smed.pep.fa
>h1SMcT0000001.1
NNMITLNFSTNRNKNKMNIMCRLCHKEPETQAHIFNHCTQTQNARRNKHNNVMKTTSDYLISKGFSVDIECTPQGIDTRLRPDLIIKSKRNKKLHILDIKVPYDREESFNAAREDNYKKYRELALEMGKTNACKTTLSALVVGTLGSWDKGNDKALDQIGLNKIEQKKLAKLCTTTAVISSYTIYIDHITNRPKQTQ
>h1SMcT0000002.1
VENIIFRYGTPKKLLTDQGKNFQSEMFKSITDLFGIWKLNTSPYHPQTDGLVERFNGTLI
NMISSYVNRKQKDWDDFIKPCLFAYRNAVNQSTGETPFYLMYLRKNNMPIDLKFQPPISQYMEIPDYKALMQEKMQSVWTKAGLTIKHNQEKYKETYDKKAHAHGIKIGDLVYLNRPKPLKGLSPKLQRPFKGPYKVCDTTETNLKLLPFRSKKANPMIVHANRCKLVPEEDESRYPLRS
$ grep h1SMcT0000002.1 ~/.local/share/genomes/Smed/Smed.annotation.gtf
chr1_h1 ONThybrid       transcript      79507   82094   .       -       .       ID "h1SMcT0000002.1"; Parent "h1SMcG0000002"; Name "h1SMcT0000002.1"; oRF_Len "641"; oRF_Type "5prime_partial"; gene_id "h1SMcG0000002"; transcript_id "h1SMcT0000002.1"; gene_name "h1SMcG0000002"; transcript_name "h1SMcT0000002.1"
chr1_h1 ONThybrid       exon    79507   82094   .       -       .       ID "exon-3"; Parent "h1SMcT0000002.1"; gene_id "h1SMcG0000002"; transcript_id "h1SMcT0000002.1"; gene_name "h1SMcG0000002"; transcript_name  h1SMcT0000002.1"
chr1_h1 ONThybrid       CDS     80170   82092   .       -       0       ID "cds-2"; Parent "h1SMcT0000002.1"; gene_id "h1SMcG0000002"; transcript_id "h1SMcT0000002.1"; gene_name "h1SMcG0000002"; transcript_name "h1SMcT0000002.1"
chr1_h1 ONThybrid       five_prime_UTR  82093   82094   .       -       .       ID "nbis-five_prime_utr-2"; Parent "h1SMcT0000002.1"; gene_id "h1SMcG0000002"; transcript_id "h1SMcT0000002.1"; gene_name "h1SMcG0000002"; transcript_name "h1SMcT0000002.1"
chr1_h1 ONThybrid       three_prime_UTR 79507   80169   .       -       .       ID "nbis-three_prime_utr-2"; Parent "h1SMcT0000002.1"; gene_id "h1SMcG0000002"; transcript_id "h1SMcT0000002.1"; gene_name "h1SMcG0000002"; transcript_name "h1SMcT0000002.1"

By the way, I did not say earlier because I thought there was no point, but motif2factors also failed when I only provided a (non-empty, healthy) pep.fa in the genomes directory (no .fa, no .gtf, no nothing except the pep.fa in the genome directory). This is more a curiosity: does motif2factors need the rest of the genome files for anything? Last time I tried a motif2factors run (ca. end of 2021) it did not. Regarding stop codons, do you mean some kind of missing feature in the GTF file, or something in the pep.fa file?

As a temporary measure to move on with setting up all the analyses, I have updated your previous motif2factors.txt file (the one you provided for us) with the new ids of the equivalent genes between assembly versions. This gives me ~970 TFs (way less than the 1900+ TFs you inferred with the previous genome version, but this will do in the time in between). Meaning: I am out of ideas of the reason behind and it would be great to figure out what is happening but it is not entirely crucial at this point. I would not like you to feel rushed or anything like that at all.

Let me know if I can help providing any more info (regarding the GTF, the pep.fa, etc...). Thanks,

Alberto