karlicoss / promnesia

Another piece of your extended mind
https://beepb00p.xyz/promnesia.html
MIT License
1.73k stars 74 forks source link

Some Indexers need an overwrite_db or last_indexed_time parameter. #279

Open hwiorn opened 2 years ago

hwiorn commented 2 years ago

I have made a Joplin indexer. But there is a problem that the indexer needs a incremental updating parameter when the database is large. I have 8000+ notes in my Joplin database. Joplin indexer finds 24000+ URLs which can be Visits. It takes 17 minutes long on my laptop.

Joplin has a update_time field in notes table. So I think I can implement incremental indexing(updating) in the indexer.

However, there is no overwrite_db parameter in the Indexer when a user pass --overwrite parameter and wants to restart the indexing. Or if last_indexed_time in the promnesia framework would be passed by iter_all_visits, It would be much more helpful.

karlicoss commented 2 years ago

Hi, sorry for late response!

It's actually surprising it takes 17 minutes, for 8K notes/24K URLs -- do you know how many lines are these? Unless your laptop is really weak, I would expect it to index much faster. Maybe you can log indexing times for individual notes, figure out the one that takes longest and then we can profile it?

Otherwise, so you suggest you could do something like

It kinda makes sense, but one downside is that it's possible that some URLs were removed from the note, and they would still be present in promnesia database, because the 'interface' of indexers in Promnesia is currently only supporting adding new visits. So it would trigger some phantom visits. We might think of changing the interface somehow, but I'd much rather speed up the indexer for simplicity.

hwiorn commented 2 years ago

My laptop is Dell Inspiron 7501(i7-10750H CPU @ 2.60GHz 16GB RAM). I don't think this laptop is a slow environment. But some machine such as RPis and AWS light-sail(1 core) could be slow.

It's actually surprising it takes 17 minutes, for 8K notes/24K URLs -- do you know how many lines are these? Unless your laptop is really weak, I would expect it to index much faster.

Many notes were from Evernote. I used Joplin as an archiving tool and wrote a journal at work. Some notes are web-clipped notes, and It seems to have many useless links. Recently, I am switching the Joplin to org-roam and learning the Zettelkasten method and I use Joplin as way-back machine now.

Maybe you can log indexing times for individual notes, figure out the one that takes longest and then we can profile it?

The Joplin indexer was a proof-of-concept, and It is just an initial version. So I think I can profile the indexing.

It kinda makes sense, but one downside is that it's possible that some URLs were removed from the note, and they would still be present in promnesia database, because the 'interface' of indexers in Promnesia is currently only supporting adding new visits. So it would trigger some phantom visits.

Right. Incremental and partial update needs two metadata at least.

We might think of changing the interface somehow, but I'd much rather speed up the indexer for simplicity.

Yeah, you are right. I can optimize the indexer better. But I think Promnesia needs incremental update code for slow machine and indexing efficiently.

karlicoss commented 2 years ago

I don't think this laptop is a slow environment

Yep, looks decent, surprising it takes so much time!

But I think Promnesia needs incremental update code for slow machine and indexing efficiently.

Yep, definitely agree it makes sense to make it as fast as we can :) Just mean there is a tradeoff between that and simplicity of the architecture.

Right. Incremental and partial update needs two metadata at least: Last sync time, Mapping ID between source and target.

Yeah -- the problem is the latter: basically currently there is no way to tell for a visit from database which file it's coming from. To be more precise, no reliable way, there is a Locator thing, but it's not guaranteed to be the exact filename.

Maybe a good compromise would be adding cachew support for file-based indexers, so basically each file would have a cache of its Visits (depending on the file timestamp), and it would automatically recompute if necessary. That would allow keeping promnesia itself simple and not worry about selectively removing stuff from the database.

hwiorn commented 2 years ago

Maybe a good compromise would be adding cachew support for file-based indexers, so basically each file would have a cache of its Visits (depending on the file timestamp), and it would automatically recompute if necessary. That would allow keeping promnesia itself simple and not worry about selectively removing stuff from the database.

I had already seen the cachew and I thought it was not the right solution for caching. I guess I didn't look closely. Let me add cachew in the indexer.

hwiorn commented 2 years ago

Related: #243