Open darshanraju opened 2 years ago
@Trabing thoughts?
The only reason sharing the training dataset and the testing dataset are in separate stages is so that it's impossible for the worker-node to cheat / subject the outcome of the training process to something akin to look-ahead bias
The testing dataset is seeded at the same time the Jupyter notebook and training dataset are, but if its magnet link is posted to the smart contract before the training process is finished: there's the possibility of cheating
Maybe we can leverage a network of seeders, so the data scientist can go offline, but I'm not sure how we would make it so that a unscrupulous worker-node operator couldn't figure that information out; programmatically or not
"The testing dataset is seeded at the same time the Jupyter notebook and training dataset are"
I don't think so. The model and training dataset is shared initially when calling shareUntrainedModelAndTrainingDataset() https://github.com/morphware/service/blob/e6d88ea36463f069313851ebc7278caa21ec5958/contracts/JobFactory.sol#L147
The testing dataset is shared later on, when calling shareTestingDataset() https://github.com/morphware/service/blob/4cbf23da52415bdec078f69deb69862e5ffcfe38/contracts/JobFactory.sol#L220
Maybe if we do infact share the testing dataset initially, and encrypt it within the contract with a private secret. And when a worker trained a model, we unencrypt and send the testingMagnet to a validator as an event. Therefore the testing magnet is held within the contract the whole time, but it only known AFTER the worker has trained the model.
Not to focus too much on semantics, here, but I'm talking about seeding instead of writing to the smart contract
I like the idea of introducing an intermediate step, though
In our efforts to remove all responsibility to call functions away from the Data Scientist. We should investigate if it's possible to have shareTestingDataset be called by the JobFactory contract at the end of TrainedModelShared.
shareTestingDataset is used to share the MagnetURI of the testing data to the validator currently. So we will have to think about how this can still provide the testingData magnet, without haivng the data scientist call anything. Store magnetLink in IPFS? The contract on job description posting initially?