Support for custom database

joshamilton commented 5 years ago

ICRA Team,

I'm interested in deploying ICRA using a database containing my own strains. Does ICRA provide the ability to specify my own database for mapping? If so, can you provide some guidance on how to create it? Thanks.

talkr commented 5 years ago

We will add this over the next few months. In the meantime you can go over the code and see how the current database is being used. Some of the things you need to do:

make some informed decision about the level of similarity between strains in your database, cluster and pick representatives (info on how we did it is in the paper's methods) concat contigs of each genome create a gem index and a few other structures.

wjj666 commented 3 years ago

We will add this over the next few months. In the meantime you can go over the code and see how the current database is being used. Some of the things you need to do:

make some informed decision about the level of similarity between strains in your database, cluster and pick representatives (info on how we did it is in the paper's methods) concat contigs of each genome create a gem index and a few other structures.

Hi talkr, could you please talk more about how to concat contigs of each genome and create the .if_ujsn, .dests and .idx files? Thanks.

talkorem commented 3 years ago

Send me an e-mail and I'll try to help.

wjj666 commented 3 years ago

Send me an e-mail and I'll try to help.

Hi,

I have run SGVFinder successfully based on the DataFiles mentioned in your GitHub. Now I want to call SVs based on my own reference genome database (have cluster and picked representatives) which includes 5,000+ genomes (each genome is a fasta file). After reviewing the files in the folder of DataFiles, I have the following problems and hope to get your help. I'd appreciate it if you could give me more clear information about how to create these files based on my own reference genomes.

For the file "representatives.contigs.drepped.lens": The first contig name of the genome PRJNA175943 is "AMQY01000001", but why is there “1238190.PRJNA175943.AMQY01000000 26”? Is there some operation in the beginning of each genome's fasta file?
For the file "representatives.contigs.drepped.idx": I know "1238190" is the taxonomy ID, "PRJNA175943" is the genome name, "AMQY01000026" is the contig name. But I don't know the what do these numbers mean such as "7567701151" for "1238190.PRJNA175943.AMQY01000026"?
For the file "representatives.contigs.drepped.if_ujsn": This file contains a lot of sequences. I don't how to create this file and why the first sequence is:
For the file "representatives.contigs.drepped.dests": For example, in the first line of this figure: 1238190.PRJNA175943.AMQY01000026 ('1238190.PRJNA175943', 135600), I don't know what's the meaning of 135600.

Thank you very much.

talkorem commented 3 years ago

Send me an e-mail and I'll try to help.

lindan1128 commented 2 years ago

Send me an e-mail and I'll try to help.

Hi,

I have run SGVFinder successfully based on the DataFiles mentioned in your GitHub. Now I want to call SVs based on my own reference genome database (have cluster and picked representatives) which includes 5,000+ genomes (each genome is a fasta file). After reviewing the files in the folder of DataFiles, I have the following problems and hope to get your help. I'd appreciate it if you could give me more clear information about how to create these files based on my own reference genomes.

For the file "representatives.contigs.drepped.lens": The first contig name of the genome PRJNA175943 is "AMQY01000001", but why is there “1238190.PRJNA175943.AMQY01000000 26”? Is there some operation in the beginning of each genome's fasta file?

For the file "representatives.contigs.drepped.idx": I know "1238190" is the taxonomy ID, "PRJNA175943" is the genome name, "AMQY01000026" is the contig name. But I don't know the what do these numbers mean such as "7567701151" for "1238190.PRJNA175943.AMQY01000026"?

For the file "representatives.contigs.drepped.if_ujsn": This file contains a lot of sequences. I don't how to create this file and why the first sequence is:

For the file "representatives.contigs.drepped.dests": For example, in the first line of this figure: 1238190.PRJNA175943.AMQY01000026 ('1238190.PRJNA175943', 135600), I don't know what's the meaning of 135600.

Thank you very much.

Hello,

Now I have the same issue. I also want to create our own database. Do you know how to create it?

Thank you so much!

fjw536 commented 2 years ago

We will add this over the next few months. In the meantime you can go over the code and see how the current database is being used. Some of the things you need to do:

make some informed decision about the level of similarity between strains in your database, cluster and pick representatives (info on how we did it is in the paper's methods) concat contigs of each genome create a gem index and a few other structures.

Hi, I wonder if you have updated the codes for mapping against custom databases? Thanks.

segalab / SGVFinder

Support for custom database #11