Scalability upgrade for large scale datasets: Sliced Dataset

tmanthey commented 3 years ago

Is your feature request related to a problem? Please describe. Ocean Market does not easily support the publishing of large scale datasets. AI image datasets can easily contains millions of images which results in datasets in the hundreds of GB or even TB. The amount of initial liquidity required to create a pool would be just beyond most publishers possibilities.

Describe the solution you'd like For our dataset we worked around the issue by slicing the dataset in 53 slices and do a weekly rotation on the download side. Nevertheless this approach is not sustainable as potential customers would have to wait a year to consume a full dataset. Therefore I propose the option to publish "sliced datasets". This requires that the publisher can configure the number of slices for his dataset during the publishing process. The slices must be numbered. The download URL must contain a placeholder for the sliceId.

http://foobar.com/datasets/mydataset_%.zip

where % is a placeholder for 1...#slices

This would reduce the requirement to provide initial liquidity to a single slice and would encourage the consumption of datasets.

https://market.oceanprotocol.com/asset/did:op:2a76F680279CE629a9F5E601BDa7246e06F226f0

Describe alternatives you've considered Very cool would be to download datasets in percentage of the total dataset or coins in the wallet. (Similar to the slider in the Binance market that allows to spend a certain percentage of your coins for a particular transaction). But that would require tracking the dataset slices that the user has already downloaded so that he does not download the same slices on the next purchase.

Koesters commented 3 years ago

Definitely a good idea. I have 7 TB zipped or 70 TB unzipped geojson data to sell.

trentmc commented 3 years ago

Note: this idea / issue was initially described in Discord. Here's the link. https://discord.com/channels/612953348487905282/612953349003673629/809887981291700307

It also outlined one potential approach:

Splitting up the datasets into many smaller doesn't solve the problem as X + gas < n * ( X / n + gas)

trentmc commented 3 years ago

There seem to be two separate concerns:

Unclear how to handle super-large dataset with many files
High initial liquidity needs for valuable dataset. In AMM-based pricing, if a large initial price is wanted then a very large amount of initial liquidity is required is "beyond most publishers possibilities."

The concerns are separate. For example, small but highly valuable datasets might have concern (2).

Aside: the title is currently "Scalability upgrade for large scale datasets: Sliced Dataset". As with any github issue, it's better if the title refers to the problem/concern, versus a proposed solution. Could you rename it please? E.g. "Unclear how to handle super-large dataset with many files + high initial liquidity needs for valuable dataset"

Let's discuss each.

1. Unclear how to handle super-large dataset with many files

Datapoint: related github issue: "Consumption / Download of Large Files (over 400MB - 500MB) Fails" multiRepo#55. It's been fixed.

That issue above is only partly related, since this issue concerns having many files:

"contains millions of images which results in datasets in the hundreds of GB or even TB".

The workaround described above was to:

"[slice] the dataset in 53 slices."
"and do a weekly rotation on the download side. Nevertheless this approach is not sustainable as potential customers would have to wait a year to consume a full dataset."

Datapoint: Ocean backend already supports >1 file for a given data asset, as a list in the DDO. Ocean.js and ocean.py support this. However it's not currently easy to use feature this in Ocean Market GUI.

Possible solutions, could be done now:

53 different data assets, one file for each, scripts or GUI. Publish: with a script using ocean.js/py, or in Ocean Market GUI. Discover: consumers use Ocean Market to see all 53 assets. Purchase: consumers buy each data asset separately with their own script, or Ocean Market GUI. Consume: consumers use their own script, or with Ocean Market GUI.
One data asset, 53 different files, publish & consume with scripts. Publish: with a script using ocean.js/py. Discover: consumers use Ocean Market GUI to see 1 asset. Purchase: consumers use Ocean Market GUI. Consume: consumers use a script using ocean.js/py.

Possible solutions, would need changes:

One data asset, 53 different files, everything in GUI. Publish: with Ocean Market GUI. Discover: consumers use Ocean Market to see 1 asset. Purchase: consumers use Ocean Market GUI. Consume: consumers use GUI. This needs changes to Ocean Market to better support >1 file in the GUI.

2. High initial liquidity needs for valuable dataset.

Concern: in AMM-based pricing, if a large initial price is wanted then a very large amount of initial liquidity is required is "beyond most publishers possibilities."

Possible solutions, could be done now:

Use fixed price. No liquidity is needed.
Use the AMM. If you want an authentic price signal from the market, it needs the liquidity, simple as that. You can simply put in less liquidity with a lower initial price; and if people like the dataset they'll stake more etc and the price will go up.
Use scripts for different ratio. Use ocean.js/py o set the ratio different than 70-30 OCEAN-datatoken (h/t Robin L). However, be careful with this as the lower the % OCEAN the less that incentives are aligned between publisher and consumers.

Possible solutions, in Ocean roadmap:

Better Staking (Q4) decouples amount of stake from price. Therefore price won't shoot up in the face of a lot of staking. It also has datatoken vesting.

@tmanthey It would be great to understand how well the possible solutions fit your needs.

trentmc commented 3 years ago

Closing since there were no responses to the Q's; that there some are ways to handle the concerns now, and that other roadmap plans / github issues handle them even better going forward.

oceanprotocol / market

Scalability upgrade for large scale datasets: Sliced Dataset #379

1. Unclear how to handle super-large dataset with many files

2. High initial liquidity needs for valuable dataset.