rmenegaux / fastDNA

Other
23 stars 13 forks source link

print-word-vectors on kallisto branch #4

Open manock opened 4 years ago

manock commented 4 years ago

Hi,

i am trying your software on the kallisto branch (which seems very promising) and have a couple of questions:

Thanks

rmenegaux commented 4 years ago

Hello @manock,

Those are 2 very good points! Indeed the kallisto index becomes very large for larger DBs, in fact on the large dataset of the paper I was not able to build the index for 17 and 19-mers. Did you manage to build one for k=31? The index holds into memory but RAM overflows when building it. Maybe a solution could be to build several de Bruijn graphs on chunks of the data then merge them. I am currently trying it out with larger datasets, if I find solutions I will tell you.

Indeed the print-word-vectors function is not yet implemented on the kallisto branch. I will try to push something in the next couple of days.

Thank you for the feedback,

Romain

On 24 Apr 2020, at 14:47, manock notifications@github.com wrote:

Hi,

i am trying your software on the kallisto branch (which seems very promising) and have a couple of questions:

at the moment, the need to build and load in memory a kallisto index for training can (very) quickly become unusable for larger DBs due to RAM limitations. Do you have any plans on improving / changing that part ?

Whereas most of of the fastdna methods takes a loadIndex parameter (on the kallisto branch), the print-word-vectors does not. I just want to make sure that the embeddings outputted by this method are the contig embeddings presented in the related paper.

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rmenegaux/fastDNA/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG32ANLFF2BNFHHDVY7SZDROGC4TANCNFSM4MQCHBMA.

manock commented 4 years ago

Hi,

Did you manage to build one for k=31?

I gave up building an index on bacterial genomes. I am trying on virus genomes, which are much smaller. However, the predictions made are always the same with probability very close (in a 1e-5 range). I tried removing the predicted species, change some parameters (k), but always have the problem. When I tried the print-word-vectors function, I noticed the provided embeddings were all very close. Could it be related to this problem ?

I will try to push something in the next couple of days.

Great, I have a good use for embeddings.

Thanks.