Closed KBeine closed 5 months ago
yes - it is possible to make a riparian for WGAs, just not with GENESPACE. We are currently trying to determine the best way to release this functionality. It may be added to GENESPACE, or may be a separate package. No matter the release method, it won't be public until fall 2024 at the earliest. If this is a functionality that your research absolutely needs sooner than this timeline, email me and we can figure out a path forward.
I have a separate issue now, but I used fa (peptide) files instead of fasta (DNA) files, the parse function didn't work with my gff and fa files to create the cleaned peptide files and the bed files, so I created them myself at the suggestion of another issue. I had to do a bit of matching between the bed and peptide files to get the gpar <- init_genespace(wd = wd, path2mcscanx = path2mcscanx) function to work, I am now stuck on the out <- run_genespace(gpar, overwrite = T) function:
gpar <- init_genespace(
- wd = wd,
- path2mcscanx = path2mcscanx) Checking Working Directory ... PASS:
/home/user/Documents/UBUNTU/Alignment
Checking user-defined parameters ... Genome IDs & ploidy ... alitta : 1 Hdiv : 1 platynereis: 1 Outgroup ... NONE n. parallel processes ... 8 collinear block size ... 5 collinear block search radius ... 25 n gaps in collinear block ... 5 synteny buffer size... 100 only orthogroups hits as anchors ... TRUE n secondary hits ... 0 Checking annotation files (.bed and peptide .fa): alitta : 246 / 246 geneIDs exactly match (PASS) Hdiv : 221 / 221 geneIDs exactly match (PASS) platynereis: 239 / 239 geneIDs exactly match (PASS) Checking dependencies ... Found valid path to OrthoFinder v2.55:orthofinder
Found valid path to DIAMOND2 v2.19:diamond
Found valid MCScanX_h executable:/home/user/Apps/MCScanX/MCScanX_h
GENESPACE prints a ton of output. We don’t provide a mechanism to suppress this because it is crucial to make sure things are looking ok. There are some descriptions about the text printed to the console (see below output …)
out <- run_genespace(gpar, overwrite = T)
############################
Running orthofinder (or parsing existing results)
Checking for existing orthofinder results ...
[1] TRUE
Copying files over to the temporary directory: /home/user/Documents/UBUNTU/Alignment/tmp
Running the following command in the shell: orthofinder -f /home/user/Documents/UBUNTU/Alignment/tmp -t 8 -a 1 -X -o /home/user/Documents/UBUNTU/Alignment/orthofinder
.This can take a while. To check the
progress, look in the WorkingDirectory
in the output (-o) directory
OrthoFinder version 2.5.5 Copyright (C) 2014 David Emms
2024-04-25 18:50:02 : Starting OrthoFinder 2.5.5 8 thread(s) for highly parallel tasks (BLAST searches etc.) 1 thread(s) for OrthoFinder algorithm
Test can run "mcl -h" - ok Test can run "fastme -i /home/user/Documents/UBUNTU/Alignment/orthofinder/Results_Apr25/WorkingDirectory/dependencies/SimpleTest.phy -o /home/user/Documents/UBUNTU/Alignment/orthofinder/Results_Apr25/WorkingDirectory/dependencies/SimpleTest.tre" - ok
2024-04-25 18:50:02 : Creating diamond database 1 of 3 2024-04-25 18:50:02 : Creating diamond database 2 of 3 2024-04-25 18:50:02 : Creating diamond database 3 of 3
Using 8 thread(s) 2024-04-25 18:50:02 : This may take some time.... 2024-04-25 18:50:02 : Done 0 of 9 2024-04-25 18:50:06 : Done all-versus-all sequence search
2024-04-25 18:50:06 : Initial processing of each species 2024-04-25 18:50:06 : Initial processing of species 0 complete 2024-04-25 18:50:06 : Initial processing of species 1 complete 2024-04-25 18:50:06 : Initial processing of species 2 complete 2024-04-25 18:50:08 : Connected putative homologues 2024-04-25 18:50:08 : Written final scores for species 0 to graph file 2024-04-25 18:50:08 : Written final scores for species 1 to graph file 2024-04-25 18:50:08 : Written final scores for species 2 to graph file 2024-04-25 18:50:09 : Ran MCL
OrthoFinder assigned 697 genes (98.7% of total) to 224 orthogroups. Fifty percent of all genes were in orthogroups with 3 or more genes (G50 was 3) and were contained in the largest 99 orthogroups (O50 was 99). There were 192 orthogroups with all species present and 184 of these consisted entirely of single-copy genes.
2024-04-25 18:50:09 : Done orthogroups
2024-04-25 18:50:11 : Done 2024-04-25 18:50:12 : Done 0 of 9
2024-04-25 18:50:13 : Starting STRIDE 2024-04-25 18:50:13 : Done STRIDE Observed 0 well-supported, non-terminal duplications. 0 support the best roots and 0 contradict them. Best outgroups for species tree: Hdiv alitta platynereis
WARNING: Multiple potential species tree roots were identified, only one will be analyed.
Outgroup: Hdiv 2024-04-25 18:50:13 : Starting Recon and orthologues 2024-04-25 18:50:13 : Starting OF Orthologues 2024-04-25 18:50:13 : Done 0 of 9 2024-04-25 18:50:13 : Done OF Orthologues
2024-04-25 18:50:13 : Done orthologues
Results: /home/user/Documents/UBUNTU/Alignment/orthofinder/Results_Apr25/
CITATION: When publishing work that uses OrthoFinder please cite: Emms D.M. & Kelly S. (2019), Genome Biology 20:238
If you use the species tree in your work then please also cite: Emms D.M. & Kelly S. (2017), MBE 34(12): 3267-3278 Emms D.M. & Kelly S. (2018), bioRxiv https://doi.org/10.1101/267914 re-ordering genomeIDs by the species tree: alitta, platynereis, Hdiv
############################
############################
...platynereis v. alitta: total hits = 310, same og = 237 ...Hdiv v. platynereis: total hits = 283, same og = 213 ...alitta v. Hdiv: total hits = 284, same og = 217 ...platynereis v. platynereis: total hits = 319, same og = 243 ...alitta v. alitta: total hits = 321, same og = 248 ...Hdiv v. Hdiv: total hits = 275, same og = 223 ############## Generating dotplots for all hits ... Done!
############################
...platynereis v. alitta: 123 hits (60 anchors) in 5 blocks (0 SVs, 5 regions) ...alitta v. Hdiv: 150 hits (75 anchors) in 17 blocks (12 SVs, 5 regions) ...Hdiv v. platynereis: 115 hits (52 anchors) in 6 blocks (1 SVs, 5 regions) ...alitta v. alitta: 255 hits (246 anchors) in 14 blocks (0 SVs, 0 regions) ...platynereis v. platynereis: 240 hits (239 anchors) in 46 blocks (0 SVs, 0 regions) ...Hdiv v. Hdiv: 227 hits (221 anchors) in 14 blocks (0 SVs, 0 regions)
############################
############################
############################
[.data.table
(bedm, , :=
(reford, (1:.N)/.N), by = "genome") :
Supplied 2 items to be assigned to group 1 of size 0 in column 'reford'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
In addition: Warning messages:
1: In mclapply(1:nrow(chnk), mc.cores = nCores, function(i) { :
all scheduled cores encountered errors in user code
2: In mclapply(1:nrow(chnk), mc.cores = nCores, function(i) { :
all scheduled cores encountered errors in user codeThis doesn't look like a eukaryote ... not nearly enough genes.
alitta : 246 / 246 geneIDs exactly match (PASS)
Hdiv : 221 / 221 geneIDs exactly match (PASS)
platynereis: 239 / 239 geneIDs exactly match (PASS)
There are a number of warnings about these genomes too ... my guess is there is something up with your genomes.
I used the genomes from alitta and platynereis from NCBI, neither of them had gff files, Hdiv is our own genome, we have annotated it. I used the busco function: busco -i genome -o busco_output -l eukaryota_odb10 -m genome -e 1e-06 --limit 1, created the bed file from the full_table.tsv and concatenated the .faa files from the single_copy_busco_sequences, it was matching to 255 genes, which all three species had above 90% match for the busco genes. Will this be too few genes to map with genescape?
Yeah, this is too few. You definitely don't want to just use BUSCO genes. Just do a full annotation (or liftover) and use that. 100s of genes per chromosome is a good starting point. With less than that, it can be challenging to find synteny except in very closely related genomes.
Hi all, is it possible to create a riparian plot from DNA sequences? I want to create a synteny plot for genomes from worm species. However, the worm species dont have their own gff files, so I put the genomes through busco and have ~255 genes for the 4 worm species with their fasta files and gff files output from busco. I concatenated all the files for each species, so only left with 2 files (fasta and gff) for each of them. But the data is DNA and not proteins. Can I still use genespace? I got genespace to work with the example data and when I try to run my data it tells me
Error: subscript contains invalid names
when trying the ParsedPaths command, my code for this is:ParsedPaths <- parse_annotation( rawGenomeRepo = genomeRepo, genomeDirs = genomes2run, genomeIDs = genomes2run, gffString = "gff", faString = "fasta", presets = 'ncbi' genespaceWd = wd)
If its relevant my data looks like this:
$ head Alitta_busco.fasta ">1001705at2759_29159_0:0046be|OW028576.1:45821503-45827982 caggATTTTGATGGAGCCCTGGACTGCGTGAGCATTGCAGTAACCTGTGCATTCAACCGA CATGGCACACTTCTCGCAGTTGGGTGTAATGATGGCCGAATTGTCGTTTGGGATTTTCTC ACTAGGGGAATAGCAAAAGTCATCAGTGCCCATGTTCATCCTGTGTGTAGCATCAGCTGG AGTCGCAATGGTAAAAAACTGCTGAGTGCTGCGACGGACAACACTGTCTCTATTTGGGAC ATCATGACGGGAGAATGTGACAAAACGTTCCGATTCCCTTCTCCCGTTCTGAAGGTCCAG TTCAATCCTCGAGACAGCAAAATGTTCATCATTTGTCCGATGAAACATGCGGCGGTCGTG ATGAGCTTAGACGGATCATATACGACTATACCTCTGGATGATGATTCTGACCTGAATGTT GTGGCGGCGTTCGACCGACAAGGAGACCACATATACACAGGCAATGCTAAGGGCAAGGTT TGTGCTTatAAAACGAAGAGCCTAGATGTTGTAGCGTCTTTCAAAGTCACAACTGGAACA"
head -5 Alitta_busco.gff "OW028573.1 MetaEuk gene 46740364 46740573 44 - . Target_ID=1001705at2759_1155016_0:002f22;TCS_ID=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363 OW028573.1 MetaEuk mRNA 46740364 46740573 44 - . Target_ID=1001705at2759_1155016_0:002f22;TCS_ID=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363_mRNA;Parent=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363 OW028573.1 MetaEuk exon 46740364 46740573 44 - . Target_ID=1001705at2759_1155016_0:002f22;TCS_ID=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363_exon_0;Parent=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363_mRNA OW028573.1 MetaEuk CDS 46740364 46740573 44 - . Target_ID=1001705at2759_1155016_0:002f22;TCS_ID=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363_CDS_0;Parent=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363_exon_0 OW028576.1 MetaEuk gene 45821504 45827983 692 - . Target_ID=1001705at2759_29159_0:0046be;TCS_ID=1001705at2759_29159_0:0046be|OW028576.1|-|45821503"