Closed tmanthey closed 3 years ago
Definitely a good idea. I have 7 TB zipped or 70 TB unzipped geojson data to sell.
Note: this idea / issue was initially described in Discord. Here's the link. https://discord.com/channels/612953348487905282/612953349003673629/809887981291700307
It also outlined one potential approach:
Splitting up the datasets into many smaller doesn't solve the problem as X + gas < n * ( X / n + gas)
There seem to be two separate concerns:
The concerns are separate. For example, small but highly valuable datasets might have concern (2).
Aside: the title is currently "Scalability upgrade for large scale datasets: Sliced Dataset". As with any github issue, it's better if the title refers to the problem/concern, versus a proposed solution. Could you rename it please? E.g. "Unclear how to handle super-large dataset with many files + high initial liquidity needs for valuable dataset"
Let's discuss each.
Datapoint: related github issue: "Consumption / Download of Large Files (over 400MB - 500MB) Fails" multiRepo#55. It's been fixed.
That issue above is only partly related, since this issue concerns having many files:
The workaround described above was to:
Datapoint: Ocean backend already supports >1 file for a given data asset, as a list in the DDO. Ocean.js and ocean.py support this. However it's not currently easy to use feature this in Ocean Market GUI.
Possible solutions, could be done now:
Possible solutions, would need changes:
Concern: in AMM-based pricing, if a large initial price is wanted then a very large amount of initial liquidity is required is "beyond most publishers possibilities."
Possible solutions, could be done now:
Possible solutions, in Ocean roadmap:
@tmanthey It would be great to understand how well the possible solutions fit your needs.
Closing since there were no responses to the Q's; that there some are ways to handle the concerns now, and that other roadmap plans / github issues handle them even better going forward.
Is your feature request related to a problem? Please describe. Ocean Market does not easily support the publishing of large scale datasets. AI image datasets can easily contains millions of images which results in datasets in the hundreds of GB or even TB. The amount of initial liquidity required to create a pool would be just beyond most publishers possibilities.
Describe the solution you'd like For our dataset we worked around the issue by slicing the dataset in 53 slices and do a weekly rotation on the download side. Nevertheless this approach is not sustainable as potential customers would have to wait a year to consume a full dataset. Therefore I propose the option to publish "sliced datasets". This requires that the publisher can configure the number of slices for his dataset during the publishing process. The slices must be numbered. The download URL must contain a placeholder for the sliceId.
http://foobar.com/datasets/mydataset_%.zip
where % is a placeholder for 1...#slices
This would reduce the requirement to provide initial liquidity to a single slice and would encourage the consumption of datasets.
https://market.oceanprotocol.com/asset/did:op:2a76F680279CE629a9F5E601BDa7246e06F226f0
Describe alternatives you've considered Very cool would be to download datasets in percentage of the total dataset or coins in the wallet. (Similar to the slider in the Binance market that allows to spend a certain percentage of your coins for a particular transaction). But that would require tracking the dataset slices that the user has already downloaded so that he does not download the same slices on the next purchase.