Open thomasp85 opened 9 years ago
About .fasta file, if I have some bacterial genomes in .seq, .fasta(only have sequence information), .gbk format, how could I transfer these files into .fasta in extdata (like >gi|71851486|gb|AE017243.1|_1 # 207 # 395 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=AGGA/GGAG/GAGG;rbs_spacer=11-12bp;gc_cont=0.275 MQTNKNNLKVRTQQIRQQIENLLNDRMLYNNFFSTIYVLNETETEIIIDFTDLIAKQEVISR* ) to use FindMyFriends?
I'm not quite sure if I understand your question. Do you have sequence information in multiple different formats? Generally you should try to have your sequences annotated by the same algorithm to avoid differences in gene detection bias...
I get it.Should use prodigal to predict protein-coding gene for prokaryotic genomes first.
Yep - or glimmer, or something else... Currently only automatic location detection is supported for prodigal created files, but there is a fork where I'm working on a gff parser that should be more broadly applicable...
There are 260779 genes from 39 organisms, how long would consume to
run "mycoSim <- kmerSimilarity(mycoPan, lowerLimit=0.8, rescale=FALSE)" ?
A lot of things factor in. First of, I don't know your computer hardware. The second thing is that kmerSimilarity is absolutely the least advised approach to calculating pangenomes in FindMyFriends as it is the most computationally heavy. If you have installed the development version (which I'll advice as it contains numerous improvements) then use the cdhitGrouping function followed by neighborhoodSplit. This way I've successfully calculated pangenomes from thousands of genomes within a day...
That some big genomes you're working with btw... ~6.500 genes
As all this doesn't concern the issue of adding an export API I would prefer if you opened a new issue for further questions (which you are welcome to do - just trying to keep issues separated)
Possible formats: