morphware / service

0 stars 0 forks source link

Make ShareTestingData get called automatically after ShareTrainedModel #78

Open darshanraju opened 2 years ago

darshanraju commented 2 years ago

In our efforts to remove all responsibility to call functions away from the Data Scientist. We should investigate if it's possible to have shareTestingDataset be called by the JobFactory contract at the end of TrainedModelShared.

shareTestingDataset is used to share the MagnetURI of the testing data to the validator currently. So we will have to think about how this can still provide the testingData magnet, without haivng the data scientist call anything. Store magnetLink in IPFS? The contract on job description posting initially?

darshanraju commented 2 years ago

@Trabing thoughts?

Trabing commented 2 years ago

The only reason sharing the training dataset and the testing dataset are in separate stages is so that it's impossible for the worker-node to cheat / subject the outcome of the training process to something akin to look-ahead bias

Trabing commented 2 years ago

The testing dataset is seeded at the same time the Jupyter notebook and training dataset are, but if its magnet link is posted to the smart contract before the training process is finished: there's the possibility of cheating

Maybe we can leverage a network of seeders, so the data scientist can go offline, but I'm not sure how we would make it so that a unscrupulous worker-node operator couldn't figure that information out; programmatically or not

darshanraju commented 2 years ago

"The testing dataset is seeded at the same time the Jupyter notebook and training dataset are"

I don't think so. The model and training dataset is shared initially when calling shareUntrainedModelAndTrainingDataset() https://github.com/morphware/service/blob/e6d88ea36463f069313851ebc7278caa21ec5958/contracts/JobFactory.sol#L147

The testing dataset is shared later on, when calling shareTestingDataset() https://github.com/morphware/service/blob/4cbf23da52415bdec078f69deb69862e5ffcfe38/contracts/JobFactory.sol#L220

Maybe if we do infact share the testing dataset initially, and encrypt it within the contract with a private secret. And when a worker trained a model, we unencrypt and send the testingMagnet to a validator as an event. Therefore the testing magnet is held within the contract the whole time, but it only known AFTER the worker has trained the model.

Trabing commented 2 years ago

Not to focus too much on semantics, here, but I'm talking about seeding instead of writing to the smart contract

Trabing commented 2 years ago

I like the idea of introducing an intermediate step, though