smwa / wistfulbooks

A static single page app to allow easy use of books from librivox.org
http://wistfulbooks.com/
MIT License
12 stars 1 forks source link

Bring wistfulbooks back to ipfs #7

Open RubenKelevra opened 3 years ago

RubenKelevra commented 3 years ago

Hey @smwa

I'm not sure where your issues with IPFS come from - but I'm sure we can work them out. :)

I got an idea of how to "just" have the bandwidth, without having to store the whole archive of files in your IPFS-Node:

Librivox can be found on archive.org. They have static web-url for all files. So we could setup a ipfs node and add all URLs via URL-Store:

ipfs add --nocopy https://archive.org/download/girl_boat_1109_librivox/girlboat_01_wodehouse.mp3

This command will fetch the file from the URL, calculate the CID, store the URL in your database, but won't store the actual file there.

Your node will (try to) fetch the file from the URL every time someone on the IPFS network site requests it.

This way we don't have to keep a copy of the actual data, only the metadata, but also guarantee that the file is intact when it's sent to the IPFS-network.

We can also generate a list of all URLs and CIDs and put them in a cluster in the long run.

This will have the other cluster members download the whole file and store them, while your server will just provide the download-source.

RubenKelevra commented 3 years ago

I just for fun put this file (directly) on one of my ipfs nodes:

/ipfs/QmVWbggufFDb7fTMSwgpV9iXRv7HrtLufz2SVBXj5rEUku

IPFS seems to work fine for me:

$ time wget https://ipfs.io/ipfs/QmVWbggufFDb7fTMSwgpV9iXRv7HrtLufz2SVBXj5rEUku
--2021-03-12 13:38:15--  https://ipfs.io/ipfs/QmVWbggufFDb7fTMSwgpV9iXRv7HrtLufz2SVBXj5rEUku
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving ipfs.io (ipfs.io)... 2602:fea2:2::1, 209.94.90.1
Connecting to ipfs.io (ipfs.io)|2602:fea2:2::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13533082 (13M) [audio/ogg]
Saving to: ‘QmVWbggufFDb7fTMSwgpV9iXRv7HrtLufz2SVBXj5rEUku’

QmVWbggufFDb7fTMSwgpV9iXRv7HrtLufz2SVBXj5rEUku 100%[==================================================================================================>]  12,91M  3,87MB/s    in 3,3s    

2021-03-12 13:38:21 (3,87 MB/s) - ‘QmVWbggufFDb7fTMSwgpV9iXRv7HrtLufz2SVBXj5rEUku’ saved [13533082/13533082]

real    0m5,893s
user    0m0,061s
sys     0m0,073s
RubenKelevra commented 3 years ago

Also: You could actually pin each file individually on the cluster. This would allow reducing the amount of storage required on each cluster member if you set it to something like 2:3 copies. This would create 3 copies on cluster nodes, and if there are fewer than 2 copies of a file remaining, the cluster would ask 2 additional cluster nodes to make copies again.

So if there are e.g. 8 cluster members, they would only need to hold just 37.8% of the size of all files. :)

This way you don't need the probabilistic pinning, since you don't need to split up large CIDs to multiple smaller ones and you have already a trusted source which holds all the data reliably.

RubenKelevra commented 3 years ago

If you don't want to run the server for IPFS, this is fine. I can run this metadata style fetching on 3 servers if you like.

I just don't have a script which gets me all necessary URLs. :)

smwa commented 3 years ago

@RubenKelevra I want to dig into this and do some research when I get time, but I do want to say that I didn't know that you could add URL's with no-copy. For low-access files like this project, that would make this feasible for me and whoever wants to help to run this project in the beginning, and will let it scale if there's ever a large ingress of users! I'm very excited about this, and I'm thinking that perhaps I shouldn't give up on IPFS yet. Thank you!

RubenKelevra commented 3 years ago

You're welcome!

A limitation of IPFS is currently, that publishing a lot of content takes time.

Each block needs to be pushed to the corresponding DHT nodes.

There are some options to decrease this delay. The easiest one is just to publish the root-CID of each file manually, to push it ahead of the queue:

ipfs dht provide <CID>

If you add a heck of a lot of content, it might even make sense to change the reprovide settings. But since this have some serious drawbacks I wouldn't recommend this in this case. It makes IMHO only sense on extremely limited bandwidth nodes which have to hold terabytes worth of data.

For everything else the default settings should be fine.

smwa commented 3 years ago

I'm less concerned about the time it takes to push that data out, and more about generating the CID's for all of the files. Even with no-copy, every server will download every file to calculate the CID's, right? If I did this once, could I take the url and calculated CID to a new server and tell the ipfs daemon to add the url(no-copy) without downloading the file?

smwa commented 3 years ago

I've realized that I'm able to cut out the middle man, and let the browser connect straight to archive.org to download whatever mp3's are necessary. This might be a problem if the files disappear from there, but that's not characteristic of archive.org. So for now, I'll leave wistfulbooks off of IPFS

RubenKelevra commented 2 years ago

Hey @smwa,

you can now access the whole archive.org catalog via IPFS:

So for example:

https://archive.org/details/art_of_war_librivox

Lists an Identifier-cid which then can be fetched via ipfs (in theory).

Currently, they seem to have some trouble with that (or it's a local issue here) but in the long run you can simply fetch the CIDs :)

The car file is available on the right side as download as well. :)

smwa commented 2 years ago

Oh, this could be useful! I'll research the reliability and see if it's stable on my end. If archive.org is willing to keep those available in IPFS then I'd prefer to switch back.

RubenKelevra commented 2 years ago

Yay!

Yes as far as I understand are they currently working on changing their infrastructure to IPFS. But I may be wrong here. :)

This is how you import a car file for testing (and a list of some example content):

$ ipfs dag import --pin-roots --stats Downloads/*.car
Pinned root     bafybeig26ogxmqudmu5cbyxrk5p3aagjv66xidrexwj27ejwrod6rg7x5a     success
Imported 852 blocks (199486088 bytes)
$ ipfs ls bafybeig26ogxmqudmu5cbyxrk5p3aagjv66xidrexwj27ejwrod6rg7x5a
bafybeig2qwchojdamlpk7gji74m5o2ebfqsrjl7gwvixhjecvv676vqroi 18084624 ArtOfWar-64kb_librivox.m4b
bafkreif7qxazz4vc2se5oil22uusvtktr7ovystsakt4zitjj6f6hyyma4 86335    Art_War_1107.jpg
bafybeifrtqvcdh3sjssvebmwnt3bka5dwoecfaolwcq46nx34vjuxjzlke 377858   Art_War_1107.pdf
bafkreif7t3gbyc6ua3r2s4dutzpw5de6zv4sgwrru2cq5oyoeup3gpqkhe 16664    Art_War_1107_abbyy.gz
bafkreihjw5kjh6jjiddhlohkmyc2qets7nxm55ocxb54spwgx7hizpl2hu 14637    Art_War_1107_chocr.html.gz
bafkreige3b7kozgt52kphvrx4v7wymnd24wgz5sfmrfgg4axwadpiqssky 1300     Art_War_1107_djvu.txt
bafkreiap6evtjkbpszpzvaaxp6gql67yfrsk3iejyfreoh5awfbeee6k6e 15511    Art_War_1107_djvu.xml
bafkreicsgqr5uepfka2u4hi4o2n35xtsoprzadxmda4wtunnc5kbxqam4q 31174    Art_War_1107_hocr.html
bafkreiawnbs5w44vfnrqkhre6uza6u7krcis7bnt52k3zmgpgkkwcwf524 41       Art_War_1107_hocr_pageindex.json.gz
bafkreiggq7yfpngbloswowl74yq4y67lpfgvxicdkfawhhp3rqditiqwyq 701      Art_War_1107_hocr_searchtext.txt.gz
bafybeic2icnuemcytz3flmvulff73jstpzvkw2zdolxtkkayk4mk72t3va 648113   Art_War_1107_jp2.zip
bafkreidpeauanedqxojxaymyyqspheu3tkvxdxbogs7ymgzldiwuypfxha 120      Art_War_1107_page_numbers.json
bafkreigznkmp5uewkfu5cc5ocpzru3fuutyr55ut7fijlidecy7dplnfri 529      Art_War_1107_scandata.xml
bafkreiajzeyqxnmev4qmmq3vptzubrcg54b2np5555gsqj6a4k6vvltema 5206     Art_War_1107_thumb.jpg
bafkreieekinkm3wzm4paeeln3yd2x7nss6gqzxd36zmpdp7xt4odi7pvia 12928    __ia_thumb.jpg
bafybeihh5rmxiohvdfhh5zdoj3t5bmcbrwoymqmpb4tyidoi2ldatenmiq 8113642  art_of_war_01-02_sun_tzu.mp3
bafybeiefu5tszfiptknjzae7hirvffgsbvoyz5cttsv2vlop26ouou7sbi 4493669  art_of_war_01-02_sun_tzu.ogg
bafkreiestxhgjqzvq5nwovbfxq72ioc3mta2xf2qgdo634mtyagidnraiq 9937     art_of_war_01-02_sun_tzu.png
bafybeieev6sq7intsyawvxv6ynr52xzjkv6hthpxe33ylj5frhqdsa37d4 4055167  art_of_war_01-02_sun_tzu_64kb.mp3
bafkreibmpa3nlcajbig4okq23gz7tkakwv7mwcpr3u4fcj6bopldqwvt6a 1969     art_of_war_01-02_sun_tzu_esshigh.json.gz
bafkreihtmvd64k6ron4pwhukbxl3eqwu7eakius7ll2niii3a7o2udmkvy 27384    art_of_war_01-02_sun_tzu_esslow.json.gz
bafybeida4dtva4msyz2qn7xpemk6kyywtr5m2yijlr43z7xd2nsnenetdy 297752   art_of_war_01-02_sun_tzu_spectrogram.png
bafybeicw45f4tuv2roegygmmxdn347h3zoo6mr3mrcjmoiu4feygensdvq 7450758  art_of_war_03-04_sun_tzu.mp3
bafybeib5adwzk6mykixjn2grwga2gfwvqjl6xv3frhejzbx52bhz7tcaxq 4180030  art_of_war_03-04_sun_tzu.ogg
bafkreigjth6ga4xzvao4u23nc6wx2uspdecmnqerimukmr4d56isi7lgky 9699     art_of_war_03-04_sun_tzu.png
bafybeifwr366mcs6pqpwysfatcu75mqlulkng5egjr3p3vupbd7jr3xhqu 3723859  art_of_war_03-04_sun_tzu_64kb.mp3
bafkreier4rp4z4plh2n2fj5wzzxmm2apohfmgl3c4ag4ebqy62m5esnnfa 1973     art_of_war_03-04_sun_tzu_esshigh.json.gz
bafkreigrwej5rcyv36xssxtsnrdnxwf3tw4osn5rpd5wgsosdi6gudah5e 26746    art_of_war_03-04_sun_tzu_esslow.json.gz
bafybeibwhqvumvhaare4eebzv3l27wcon7vdrez26vliwnbp74wrvyrrsu 299252   art_of_war_03-04_sun_tzu_spectrogram.png
bafybeibubluvxxbus7w2pdfnjxnxpvurku2aryc37zc6swtiydb3n6ckem 10191735 art_of_war_05-06_sun_tzu.mp3
bafybeihddfk7kk6brciwyziwzxitklnad3fifp7kbyyirz4njx2d26qvru 5699415  art_of_war_05-06_sun_tzu.ogg
bafkreialpuvia4b47u3umeww6sql6ylmddrzx7z6pa2lzcsizvcffc4msy 9868     art_of_war_05-06_sun_tzu.png
bafybeifg4hp3so3s6cc6f7mfqv7nqtouuvwjdpdy5wh52s6mwaakbwaepu 5094335  art_of_war_05-06_sun_tzu_64kb.mp3
bafkreib5z2gxsidikfzki6pj4irdsyzf2bbvlhevuc3ef26orasj7bcd3u 1968     art_of_war_05-06_sun_tzu_esshigh.json.gz
bafkreidr2tirk7i54rnt5pshikxygq7nc4mxegf254h5krl4ct5fxtcfse 29158    art_of_war_05-06_sun_tzu_esslow.json.gz
bafybeicskkv3ndpjw55af6d6vn6hw2hkoehmr6o3ee4vmjxoq53srnr3zi 295143   art_of_war_05-06_sun_tzu_spectrogram.png
bafybeihghxlijdcy5xka47dz2nhax2kfcka55mxtqm5sj3mzqckrpu743a 8496074  art_of_war_07-08_sun_tzu.mp3
bafybeicczkds5nkrt4f6cabqvhn2cciwcvqcdn4lpo6yzphfelunw3akxi 4798509  art_of_war_07-08_sun_tzu.ogg
bafkreibourutqpxlphls5o6rn5ayhkh4ajlyqcgmyuebvxf545gumcawca 10235    art_of_war_07-08_sun_tzu.png
bafybeien5andj5hkpmaooc2a2t6jehfanz6ftt756zlsv3pye5xckpbqya 4246508  art_of_war_07-08_sun_tzu_64kb.mp3
bafkreiegvzibiplcyuxqkgvlrgbxbfqawjavbl4i2ku2mdrrfrejdcxxbq 1958     art_of_war_07-08_sun_tzu_esshigh.json.gz
bafkreid6i3w2uq5cmtzwzwsnegvjm5fistczw2zar3zle4vrvqfhloy2yu 27637    art_of_war_07-08_sun_tzu_esslow.json.gz
bafybeia3zvyczkei43guslt37tjjyrnsqibyeq2zcldaipxt657ltjn3jq 298736   art_of_war_07-08_sun_tzu_spectrogram.png
bafybeiesadyftvi7ikidihme7etyedsbyub4dcx3gvamdqvgesflmokd5i 14122223 art_of_war_09-10_sun_tzu.mp3
bafybeidha5do6rxbv2iz6ky6hg5qiv4xqsrwchbrpczbe3m4t6sbtqqhq4 7910454  art_of_war_09-10_sun_tzu.ogg
bafkreifnk3hr7ioictzd4euz35afhjmtc3eavmmdx3p3agkwxe3x7n5rtq 10051    art_of_war_09-10_sun_tzu.png
bafybeidy5ovfzl5pobf5crav2nlubzd553prnbricsuzqrcipdnvurvasy 7059580  art_of_war_09-10_sun_tzu_64kb.mp3
bafkreihc4jx2fhembf6yokkbtah7mpoat5zdsijpiwx5y4u4rqhrtif5ym 1963     art_of_war_09-10_sun_tzu_esshigh.json.gz
bafkreib55skxvxjedru6pqq4i7cdnfha4ibkvcbeaewdg5xdko5yle73f4 32076    art_of_war_09-10_sun_tzu_esslow.json.gz
bafybeigmag7uueiblaicordyi5qgzamvmgohvk36lmlqguv6po3qsf7jjm 292572   art_of_war_09-10_sun_tzu_spectrogram.png
bafybeidl43fj75mochlcs2mkr4piuxvgljlwdjfpn3z2ahb7wo7omkk6ke 12425429 art_of_war_11_sun_tzu.mp3
bafybeihwextzioap4nucdmv3zi3nx5a3f6mxqpj5wx4pmixy7oxvlmfkk4 6959231  art_of_war_11_sun_tzu.ogg
bafkreia3xq5avelboodkkrfwaimr2axlxnam46kld4ghaboxawujsxbjdy 9799     art_of_war_11_sun_tzu.png
bafybeigmuvbq5wggxucwguh3njegsgsehtg6cgqdlugrjie6tk52hflq7e 6213090  art_of_war_11_sun_tzu_64kb.mp3
bafkreifyfsari2pibb3phw62hpjgiu7sttd2fl3cugyydbpcu2jpc5ems4 1938     art_of_war_11_sun_tzu_esshigh.json.gz
bafkreihwgaj2hqorliph255knk5kjwrkc7vum32m7rvbrv5sahk3gzeskq 31126    art_of_war_11_sun_tzu_esslow.json.gz
bafybeigtnm56td2fkvvyhwq27fgta4kgvdm33n5xa5qwoainvs5qbi2ep4 294588   art_of_war_11_sun_tzu_spectrogram.png
bafybeihmo3slxmtpzx4ywiwnri27aukhxo7szewcurhf455zbymdncbiba 8624947  art_of_war_12-13_sun_tzu.mp3
bafybeibldcs47prcjh5mhqduj35b673knlqwoxk3b46q3sitmjsewxstfe 4808500  art_of_war_12-13_sun_tzu.ogg
bafkreibcwgprnhochpjkvnep3yb5rhj4q4jvsvflq4lgtqfon34qwkb7z4 10212    art_of_war_12-13_sun_tzu.png
bafybeie7xvcdsw2nobr6gccdfjo4z3eaoresp5fwvi7uck5yyimn3puptm 4312968  art_of_war_12-13_sun_tzu_64kb.mp3
bafkreiaseddduovhxwvm7cjwov76j4wzj2c5vikyoko4la3lrd5opqjxum 1957     art_of_war_12-13_sun_tzu_esshigh.json.gz
bafkreicot3nibnnxvo456q74elmaopreunmbjsztykwvvge4qc6yu3hje4 27675    art_of_war_12-13_sun_tzu_esslow.json.gz
bafybeidsj3sum4vsirc5cf27pym5o2vmvelmrhyzdwtjz3arijxtxm3vrq 299162   art_of_war_12-13_sun_tzu_spectrogram.png
bafkreihdwdcefgh4dqkjv67uzcmw7ojee6xedzdetojuzjevtenxquvyku 0        art_of_war_librivox.car
bafkreid7lhneqcri5gco4nlyfootmzhgtezbrusdpxikbucprcllkpeime 7901     art_of_war_librivox.json
bafkreihdwdcefgh4dqkjv67uzcmw7ojee6xedzdetojuzjevtenxquvyku 0        art_of_war_librivox.pack-car.trigger
bafkreihdwdcefgh4dqkjv67uzcmw7ojee6xedzdetojuzjevtenxquvyku 0        art_of_war_librivox.storj-store.trigger
bafkreia6sffh7gm5taarpxlnifv3ely3vfhijufvrfohkntztcc67vk7jq 564      art_of_war_librivox_128kb.m3u
bafkreifv6dbpkhyypuftthhx6vpgxih4hi2xrbfhehnll3bvvuaglpqszm 599      art_of_war_librivox_64kb.m3u
bafybeibmalqq5cnn5tpdtrl7uhehyalj37a2p6hetazyj2zdebt3ezdliy 34706755 art_of_war_librivox_64kb_mp3.zip
bafkreicvl5iymkl27mtx6fzx36ougtf5hwufmnmsqvu72hlj2mfyxbragi 28726    art_of_war_librivox_archive.torrent
bafkreihuwiuh7qcbf4bu73l5p4d3cb2g64xy7as75hidjozzikbmbtein4 31571    art_of_war_librivox_files.xml
bafkreigixh6glwbk7ujsgy5apbdnvq54w55buecw6xburolhsyp4xuxhkm 11264    art_of_war_librivox_meta.sqlite
bafkreidvfo4hwz2jx2khsm65s2debms42ohu7pwbbchae3lmw3mn4h3d7i 3627     art_of_war_librivox_meta.xml
bafkreih4ujy7cofop7kn7gjck4trakcan7oo3e2hdqs2i2uksdovaaxhx4 11759    art_of_war_librivox_reviews.xml
RubenKelevra commented 2 years ago

@smwa looks like that's more the exception right now – sadly. So I think the most books are currently not shared via ipfs.

Can you share/generate a list of all the URLs of the content you need, I could just fetch it all with my server and stored as urlstore (so fetched on demand if accessed via IPFS).

Shouldn't be an issue, if the URLs to archive.org are stable - which looks like they are.

If you like, I could also create an IPFS collaborative Cluster instance for it, so other people can store the files "really" on their server/computers.

smwa commented 2 years ago

Ah darn. Hopefully that's one of their projects, since they seem to be committed to a resilient web. I can get you a list(available at https://raw.githubusercontent.com/smwa/wistfulbooks/master/catalog/books/*.json:sections[*].path)

That's a lot of processing for you, and would need to have additional work each time books are added. I am not ready to switch over to urlstore since there are no decentralization benefits to it.