neume 2.1 ? - Githubissues

il3ven commented 1 year ago

Our current roadmap for neume is to support decent, lens and make the crawler more generic. The below are few technical changes which I propose for this roadmap.

Save Tracks instead of NFTs

Our schema currently represents an NFT. However, multiple NFTs can represent the song (track). This leads to duplication of data. The consumer of neume has to merge NFTs into tracks.

We stuck with NFTs because it was simpler and levelDB isn't suitable for tracks.

Pros of moving to Tracks

It will make the crawler more generic because not every protocol will publish audio as NFTs. For eg. lens.
We will save space since multiple NFTs can point to the same track.

Problem with saving tracks in levelDB

LevelDB is a key-value database. Imagine we have the following track in our database. owners is the list of owners for this track.

{
  ...
  "owners": [],
  ...
}

If two threads simultaneously update the owners field they will have to overwrite everything.

// Thread 1
const oldTrack = getTrack(id)
const newTrack = oldTrack.owners.push('0x123')
updateTrack(newTrack)

// Thread 2
const oldTrack = getTrack(id)
const newTrack = oldTrack.owners.push('0xabc')
updateTrack(newTrack)

Let's suppose thread 2 finishes last. We have the following value in our database.

{
  ...
  "owners": ["0xabc"],
  ...
}

Databases like MongoDB allow to insert values into a nested field but unfortunately levelDB doesn't. We can write code and add this functionality in levelDB but it won't be flexible. If we have another field like owner in the future we will have to write more code. Not ideal.

Using sqlite to solve the above LevelDB problem

I propose to give sqlite a try. To save effort we can use ORMs such as sequalize.

We dismissed sqlite before because it was pointed out that it has slow write speed. I argue that speed isn't our top priority and how slow can sqlite be.

Make strategies more generic

To be written...

neatonk commented 1 year ago

It sounds like the fundamental issue you're describing is a race condition between threads. Both threads update the same value and the last write wins, which is not what you want in this case. Instead you would like the result to include both values.

You have proposed sqlite as a solution. Would the idea be to model the owners array as a many-to-one relationship in which each track has many owners? If so, then this is solving the issue by changing the data model.

I'd like to share some alternatives, but have run out of time. Looking forward to more discussion on this later.

il3ven commented 1 year ago

It sounds like the fundamental issue you're describing is a race condition between threads. Both threads update the same value and the last write wins, which is not what you want in this case. Instead you would like the result to include both values.

Exactly.

You have proposed sqlite as a solution. Would the idea be to model the owners array as a many-to-one relationship in which each track has many owners? If so, then this is solving the issue by changing the data model.

Yes, I do plan to implement a many-to-one relationship.

I'd like to share some alternatives, but have run out of time. Looking forward to more discussion on this later.

Alternatives are most welcome.

neatonk commented 1 year ago

Alternatives are most welcome.

Nice. Thanks!

The most apparent alternative would be to stick with levelDB, but change the data model to avoid the race condition. In this case that would mean creating a new key for tracking owners. Something like some-key-referring-to-an-nft/owners/0xabc, where the first part can be the key you are currently using and the last part is the owner address. The value at that key could be blank or include details like chain id and block number, which are likely represented in the key already. There may be trade-offs affecting usability on read that would need to be considered in more detail.

Also, is the key structure documented somewhere? Just now realizing I am not entirely up to speed on that.

neatonk commented 1 year ago

Another option would be to consider the use of a CRDT with the desired semantics. I am not familiar enough with the use of CRDTs to suggest how it would apply in this case. https://crdt.tech/implementations

il3ven commented 1 year ago

Also, is the key structure documented somewhere? Just now realizing I am not entirely up to speed on that.

It isn't documented but you can find it here. https://github.com/neume-network/crawler/blob/7a2a215e8b6c8f7fbc179734b0b098b8e8ac9b27/database/index.ts#L30

Something like some-key-referring-to-an-nft/owners/0xabc, where the first part can be the key you are currently using and the last part is the owner address. The value at that key could be blank or include details like chain id and block number, which are likely represented in the key already.

Yes, I have also thought about this and it is valid solution. However, we will have to write code to merge the owners on read. We can do it for now but if the schema changes in the future we will have to do a rewrite. Also, if we introduce new many-to-many or one-to-many relationships then we will write more custom code.

I will have a look at CRDT too but if sqlite doesn't impact our performance then we should use it instead of implementing everything ourselves. I believe the network calls will be the bottleneck while crawling and not our DB.

reimertz commented 1 year ago

To add some spice to this conversation - I have been thinking about the possibility of going down the route of piggy-backing on the progress of Strapi and have Neume being a fork of their project (that we keep up to date by merging new releases / bug fixes).

The reason why this idea is intriguing for me is that we'd get a lot of functionality for free

Database adapters (SQLite, Postgres, MongoDB, MySQL, MariaDB)
CRM
Schema creation / validation
Import / Export / Syncing
GraphQL / REST APIs
Cron Job support (current crawl command could prob be re-written to a cron-task)
Deployment
Hosted Deployment
CLI (could rename / add neume crawl / neume daemon etc)

But there are some concerns / blockers with this approach;

Strapi is not ACID-complaint (relates to above pointed out issues)
Updates are very costly - for us, that means the owners and transactions. But I do think if we went down the route fo utilizing their one-to-many / many-to-many relationships, we could limit writes and therefore get better write performance.

Would love your input / ideas regarding this @il3ven @neatonk

(This is probably more of a neume 3.0 discussion, but intrigued to hear what you think )

neatonk commented 1 year ago

To add some spice to this conversation - I have been thinking about the possibility of going down the route of piggy-backing on the progress of Strapi and have Neume being a fork of their project (that we keep up to date by merging new releases / bug fixes).

Interesting idea and spicy, as advertised.

I'd argue against this for neume mostly because I think it would be detrimental to other use cases of neume that wouldn't need any of that. That said, I think it would be reasonable to structure neume as a library that can be embedded into other apps with minimal friction. Could be a good thought exercise to ask what would need to change about neume for that to be feasible.

neume-network / crawler

neume 2.1 ? #7

Save Tracks instead of NFTs

Pros of moving to Tracks

Problem with saving tracks in levelDB

Using sqlite to solve the above LevelDB problem

Make strategies more generic