"ipfs add" calculate CID and check for it *before* uploading to API server.

kallisti5 commented 4 years ago

ipfs add should calculate the CID and check for it's existence on the api server / network before copying the data over.

An example scenario:

I rsync 67GiB worth of files to my system
I ipfs --api=$IPFS_API_SERVER add -r /cache/* to make them available.

In the above scenario, my remote ipfs client seemingly needs to upload 67.88GiB of files to the API server every time I sync to IPFS. Ideally the local ipfs binary could quickly generate a CID and check for the CID's existance before transferring the data to the remote API server. Even in local deployments this could save time since IPFS unnecessarily transferring files.

ipfs

aschmahmann commented 4 years ago

@kallisti5 this is mostly a function of the "client" being very thin and would likely be a bit of a pain (although not infeasible) to implement without breaking changes. A few notes:

1) In order to add a file/directory to IPFS it needs to go through the the process of being converted into IPLD blocks (i.e. chunking + UnixFSifying the data).

This means that in order to calculate the CID we'd need to do all the chunking and processing work on the client end, which might be costly
In order to not duplicate chunking work we'd need to use an API endpoint that uses blocks instead of files, which is a breaking change from how ipfs add works today 2) IMO this is going about things the wrong way. Instead of trying to make the client "smarter" instead data should just be imported to IPFS as locally as possible in order to take advantage of IPFS's properties (e.g. data deduplication, downloading data from many parties, download resumption, etc.)
- For example, if you're trying to import 67GiB of data to a go-ipfs instance that's on the "Server" why ipfs add --api from the "Client" instead of either:
  - Adding the data to the server and then adding it to go-ipfs
  - Running a second IPFS instance on the "Client" and after adding files on the Client running an ipfs add on the server (performance of the transfer can be boosted by having the Client do an ipfs swarm connect /multiaddress/of/server beforehand.)

Even in local deployments this could save time since IPFS unnecessarily transferring files.

I agree that there is over-coupling of the local CLI interactions with the remote CLI + HTTP API. If we could separate them out more cleanly we could take advantage of their differences. For example, when working with a local daemon we could send file names instead of byte streams to the daemon so that we don't even have to read the data on the client end. However, for this to be worthwhile we'd have to get some benchmarks showing that streaming the data over HTTP from one process in the same machine to another makes a measurable impact on ipfs add performance.

kallisti5 commented 4 years ago

For example, if you're trying to import 67GiB of data to a go-ipfs instance that's on the "Server" why ipfs add --api from the "Client" instead of either:

Adding the data to the server and then adding it to go-ipfs Running a second IPFS instance on the "Client" and after adding files on the Client running an ipfs add on the server >> (performance of the transfer can be boosted by having the Client do an ipfs swarm connect /multiaddress/of/server beforehand.)

Mostly because most folks architect around microservices these days :-) Here are a few use cases:

Home usage: I have a Synology NAS with docker, running the IPFS daemon there... exposing it via NAT forwarding, and using it as a gateway for nodes on my network (as well as my phone). To add something from any machine on my network, I add --api to the client's ipfs settings and can publish things to the IPFS network.
Server usage: Deploying IPFS to a container in kubernetes, having multiple other containers use the IPFS daemon (or cluster) as a "gateway" to ipfs services.

As for alternatives, Is there any way to generate the CID multihash outside of the ipfs daemon for individual files, check for it's existence on the IPFS node, then only adding the file when missing? I see a few multihash projects in rust

Not 100% sure how adding a folder of files would work if I can't ipfs add -r directory/* though...

Someone mentioned their rsync-to-ipfs script here which gives some clues how to generate folder CID's external to the IFPS daemon, but it's pretty complicated to just sync some files to IPFS :-|

https://github.com/RubenKelevra/rsync2ipfs-cluster/blob/master/bin/rsync2cluster.sh

#get new rootfolder CIDs
ipfs_mfs_folder_cid=$(ipfs_api files stat --hash "/$ipfs_folder") || fail 'repo folder (IPFS) CID could not be determined after update is completed' 400

echo -ne ":: publishing new root-cid to DHT..."
ipfs_api dht provide --timeout 3m "$ipfs_mfs_folder_cid" > /dev/null || warn 'Repo folder (IPFS) could not be published to dht after update\n' -n
echo "done."

kallisti5 commented 4 years ago

tldr; this would all be a lot simpler if ipfs add -r directory/* would identify CID's as "already existing" on the target IPFS node and skip uploading them. You would be able to keep massive directories in-sync to IPFS with a few lines of code vs 600+ line scripts manually managing data.

lidel commented 4 years ago

@kallisti5 I only glanced at your use case, but perhaps all you need is:

ipfs add --offline --only-hash <file> run against local repo will return CID without sending the data over the net or adding it to the local datastore
ipfs block stat --api=$IPFS_API_SERVER --offline <cid> will return error if CID is not already present in the remote datastore

kallisti5 commented 4 years ago

Oh... ipfs block stat does indeed seem pretty close to a solution. Is there a way to build a directory CID without uploading all of the files?

aka.. ipfs add -r mydir builds a directory of CID's and filenames and returns a top level directory CID. The only way i'm aware of to build it is to do an ipfs add -r ... (which will upload everything)

There seems to be json data structure backing these...

$ ipfs object get --api=/ip4/192.168.1.10/tcp/45001 QmXxje2qwGHG6FxLAgrCYqiJrMq1Ltfc1Aon9u6L2HoDdi

{"Links":[{"Name":"ipfs-sync.txt","Hash":"QmPgFizpQhTvJdYutZb5YtsQPzRciinn7qmBejupTQJPx9","Size":48},{"Name":"x86_64","Hash":"QmNbMM6TMLAgqBKzY69mJKk5VKvpcTtAtwAaLC2FV4zC3G","Size":37169349516},{"Name":"x86_gcc2","Hash":"QmPA39zYE5cQTVnCRxxNG7kkdpf25zCwwjMgcXeVqRqt13","Size":37446539334}],"Data":"\u0008\u0001"}

$ ipfs object get --api=/ip4/192.168.1.10/tcp/45001 QmNbMM6TMLAgqBKzY69mJKk5VKvpcTtAtwAaLC2FV4zC3G

{"Links":[{"Name":"current","Hash":"QmcDFXCBQJmUg2CidjpSDeAwyVddQooh2aTkQAQMaFairM","Size":37169349458}],"Data":"\u0008\u0001"}

$ ipfs object get --api=/ip4/192.168.1.10/tcp/45001 QmcDFXCBQJmUg2CidjpSDeAwyVddQooh2aTkQAQMaFairM

{"Links":[{"Name":"listing.txt","Hash":"QmXpe546K4cncziAAXYfjjVoBFQiVWwEhnbjYVRtWPLGCR","Size":3491150},{"Name":"mirrors.txt","Hash":"QmejewKCQTyjcQz5GwqRwh1NBrjVDGZn4DkxFZ9LnnM7BE","Size":96},{"Name":"package.list","Hash":"QmQrxoaCDga6x8Pg3HwJsCQk1d8gU74fU2dzFYXGgQH7HP","Size":193320},{"Name":"packages","Hash":"QmQA2gTcUsnCz7STvfoorVijp2FjnSscmaxc98W5L43FRt","Size":37164149982},{"Name":"repo","Hash":"QmRypZ6qCEjan43fSRf2NQWuEWMNfavsL8Xm1wpprjhVRd","Size":1318774},{"Name":"repo.info","Hash":"QmfP3Skjx96rN8CnsBfgHLgVHDjh7whZiv2vaPJ2P65HWx","Size":180},{"Name":"repo.sha256","Hash":"QmfDXhPujxC9jUb6bd2BbfX6F1XZv7o31fsJPYxv9ZuA33","Size":72},{"Name":"repo_consistency.txt","Hash":"QmYL6CnmBj1rcKNJ7giBL7Np8E5xb1ZHB4M3AhYKx9PaRY","Size":2198},{"Name":"report.txt","Hash":"QmWE91M7cN3pRqwLneeR7ppWLovMDaFRhSkfiBdb8gjAKJ","Size":193192}],"Data":"\u0008\u0001"}

lidel commented 3 years ago

Is there a way to build a directory CID without uploading all of the files?

Yes, you can use ipfs object patch and add-link to build any unixfs directory tree you want, all without fetching the actual content of CIDs you use while building it. See examples in:

kallisti5 commented 3 years ago

This definitely seems like a potential solution, but it's still pretty slow.

NEWFOLDER=""
find /cache -type f -not -name ".*" | while read line; do
        HASH=$(ipfs add -q --offline --only-hash $line)
        BARENAME=$(echo $line | sed -e 's/^\/cache\///')
        NEWFOLDER=$(printf "$NEWFOLDER\n$HASH:$BARENAME")
        ipfs --api=$IPFS_API_SERVER block stat --offline $HASH > /dev/null 2>&1
        if [[ $? -eq 0 ]]; then
                echo "[HAVE] $HASH $BARENAME"
        else
                echo "[NEED] $HASH $BARENAME"
                ipfs --api=$IPFS_API_SERVER add $line
        fi
done

[HAVE] QmcGhQD1XZciqT7XsZAgThFy8jCf9Prx5YKvCs4wyvm9WC x86_64/current/packages/libnsbmp_devel-0.1.6-1-x86_64.hpkg
[HAVE] Qmf7NdKJdgU8bECLX8UCdkuCtNiJ34NxF9jUfXdWDtoRkK x86_64/current/packages/libnsbmp_source-0.1.6-1-source.hpkg
[HAVE] QmQcDj37twC2UMLq6iVWvGKzm4x2kSfq896sjN8DBMf6o1 x86_64/current/packages/libnsgif-0.2.1-2-x86_64.hpkg
[HAVE] QmTcXpxxrLfyAVXALuCsf17dH4XDLe1wqg45vHNGYh73bc x86_64/current/packages/libnsgif_devel-0.2.1-2-x86_64.hpkg
[NEED] QmeXinXcUQE7pC4gT1Z1VU6d4wRPpEkqZX9sLBByoBDDGW x86_64/current/packages/libnsgif_source-0.2.1-2-source.hpkg
added QmeXinXcUQE7pC4gT1Z1VU6d4wRPpEkqZX9sLBByoBDDGW libnsgif_source-0.2.1-2-source.hpkg
 264.13 KiB / 264.13 KiB [=====================================================] 100.00%
[HAVE] QmcwQMGmtARo2q9JJGk91Px64Q1VG5rW8FjRW3omz1nZRn x86_64/current/packages/libnslog-0.1.3-1-x86_64.hpkg
[HAVE] QmYzTfwCcCQ4Fe5NZCVkS1ajDdyBAhuLy7a5nfbKGTmLVm x86_64/current/packages/libnslog_devel-0.1.3-1-x86_64.hpkg
[HAVE] QmfVVTX9HAVMhrxYcWPm4c5Ymp6X1k1tSBKY1UGTyM53ho x86_64/current/packages/libnslog_source-0.1.3-1-source.hpkg

Now you take NEWFOLDER which has: CID:relative/path/to/file

Then you patch an empty folder, appending each resulting CID on top of the last? (generating a TON of unused folder structures on the IPFS node.

I think this example helps to demonstrate how quickly a basic task of keeping a large number of moderately sized files can quickly overwhelm the CLI tools.

If ipfs add -r locally calculated and checked for the remote file before sending it over the wire, the "remote IPFS node" use case gets a lot less painful.

I get that the core demographic is desktops running IPFS, but in any kind of server or "central IPFS node" environment things get painful quickly. (I know of IPFS cluster, but it would suffer the same issues in theory unless you're calculating and uploading directly from a cluster member?)

kamaradclimber commented 3 years ago

Would also be interested in such solution for the following use case: making backup of data stored by containers. The idea is to make a frequent backup of data stored by a given container while avoiding to resend data that has not changed (we can assume that most files won't change between two backups).

Bonus point if the solution works with ipfs-cluster :)

RubenKelevra commented 3 years ago

@kamaradclimber well, this sounds more like a use case for a local ipfs node that can access the containers via the filestore option. You can create the CID and add it to the cluster, while the filestore will provide the data when the cluster requests them.

kamaradclimber commented 3 years ago

Thanks for your answer. I think I've not explained correctly the use case. I'm not trying to distribute the container image but rather to make frequent backups of some of the data an app instance (which happens to run in a container) is producing.

kallisti5 commented 3 years ago

A quick addition on "why".

IPFS seems like it would be a powerful tool to replicate a software package repository (think apt repo). It has:

Built in transparent signature checking
Ability for users to mirror packages by pinning an IPNS entry
Deduplication to reduce space utilization
Any user can "setup" a mirror via an IPFS gateway
Any gateway could be used as a mirror, accessing content pinned by all users

This use case is intense (and seemingly way more useful than the current hosting a "static website" design plans). Who needs rsync mirrors when you can just have folks pin content on IPFS?

lidel commented 3 years ago

Back to the original feature request:

"ipfs add" calculate CID and check for it before uploading to API server

go-ipfs now supports ipfs dag export|import commands, which make it possible to implement requested behavior in userland:

Run ipfs add on machine (A), it will produce CID. If you don't want to duplicate data, enable filestore and do ipfs add --nocopy
Run something like ipfs files stat /ipfs/{cid} --with-local on machine (B) to see if it is 100% present on (B)
If not, ipfs dag export on (A), copy CAR archive to (B) and do ipfs dag import there.

You can do this today with go-ipfs 0.9.0

We could introduce ipfs add --create-car which creates a CAR archive first, and then uploads it to API via /api/v0/dag/import (+ leaves CAR artifact as side-effect), but this seems like a lot of work to account for a niche use case, which could be solved in way more flexible way in userland.

Perhaps a better home for this feature is a tool/script built on top of ipfs command?

liqsliu commented 3 years ago

my solution: check curl -m 2 --max-filesize 1 -s -o /dev/null -w '%{http_code}' https://bafkreibuuyw5r6ji4u4ysi3w6f45mfbvyravqb77vtp5kootzncpnfqpbm.ipfs.infura-ipfs.io/?filename=file_781.webp

ipfs / kubo

"ipfs add" calculate CID and check for it before uploading to API server. #7586

ipfs / kubo

"ipfs add" calculate CID and check for it *before* uploading to API server. #7586

"ipfs add" calculate CID and check for it before uploading to API server. #7586