namoray / nineteen

nineteen
7 stars 16 forks source link

Synthetic data generation #55

Open namoray opened 2 weeks ago

namoray commented 2 weeks ago

Instead of synthetic generation of payloads being on the control nodes on a fixed schedule, the synthetic generation of the payloads should be dynamic for each query. This means all synthetic queries can be unique.

Synthetic generations of payloads will need to be really quick, else we might not be able to keep up. Imagine there are 10 image-to-image requests per second you must send off. We naturally can't get a new image for each payload, or do any sort of heavy processing else we wont be able to keep up with demand. So some components might need to be cached, but every query should be unique.

So, instead of pulling a synthetic query payload from Redis, we could generate the payload on the query node directly, for example. Something to think about there - what if the dataset needed is a few hundred MB, will this cause issues?

See: https://sn19.ai/pdfs/sn19-deck.pdf for info of the mechanism

The docs should guide you very well. For testing, you can probably get away with doing this without creating a wallet, since you don't need to interact with miners. You might need to comment various bits of the flow out to get it to work in that case, but is probably more effecient.

tripathiarpan20 commented 5 days ago

Adding a bit more context:

The control node executes continuously_fetch_synthetic_data_for_tasks to update the synthetic data payload for all tasks in separate Redis keys. Meanwhile in query node, message.query_payload = await putils.get_synthetic_payload(config.redis_db, task) fetches the synthetic data payload from the associated task key in Redis. The same Redis system is shared over control node and query node through docker network.

Although the above guarantees that synthetic data is readily fetchable by query node (as control node periodically refreshes the synthetic data key in Redis), it leads to Redis db operations being slowed down.

The main idea is to readily generate the synthetic data on query node and disposing the continuously_fetch_synthetic_data_for_tasks event loop in control node, but the synthetic data generation needs to be quick, in production serving a least 6-8 synthetic requests per second (this is the total over all tasks)