harvard-lil / perma

Indelible links
420 stars 71 forks source link

Upload to Webarchive still doesn't work after months #3255

Closed browserwatchtg closed 1 year ago

browserwatchtg commented 1 year ago

image

I already reported to you such problem months ago, but nothing was done until now. Please fix once at all upload to webarchive

browserwatchtg commented 1 year ago

still doesn't work

rebeccacremona commented 1 year ago

You can track ongoing progress on this long-term project by following the many PRs referencing the Internet Archive or the associated project board.

browserwatchtg commented 1 year ago

and now what does this mean, have you idea to readd all previous backups on wayback or not? Currently almost all from 10.22 is missing, so obviously even new things we do now (means in the present).

plus this page seems to be much technically about what you are doing in background. @rebeccacremona

rebeccacremona commented 1 year ago

Yes. We stopped in October of 2022 intentionally, in dialog with Internet Archive, and will resume when our new code is ready, in dialog with Internet Archive. There will be no gaps in coverage. This is not a bug: this is a planned migration.

You may follow the technical progress of the migration as discussed; Perma.cc's commitment to contributing copies of its archives to IA has not changed, regardless of the timeline during which it is achieved.

browserwatchtg commented 1 year ago

@rebeccacremona image are you now just uploading a list of all perma.cc each day?, means no full searchable articles anymore?

rebeccacremona commented 1 year ago

@browserwatchtg We are still uploading each Perma Link's complete WARC and metadata. The Internet Archive exposes file-level metadata in each Item's files.xml. For example, here is the list of all available files for 2017-11-13, which includes a number of metadata-only files at the bottom of the list:

filename timestamp size
daily_perma_cc_2017-11-13.cdx.gz 24-Jan-2023 00:06 11.2M
daily_perma_cc_2017-11-13.cdx.idx 24-Jan-2023 00:06 14.3K
daily_perma_cc_2017-11-13_archive.torrent 24-Jan-2023 00:07 526.5K
daily_perma_cc_2017-11-13_files.xml 24-Jan-2023 00:07 1.2M
daily_perma_cc_2017-11-13_meta.sqlite 20-Dec-2021 14:12 2.0M
daily_perma_cc_2017-11-13_meta.xml 24-Jan-2023 00:06 770.0B

files.xml contains complete metadata about each link, including the submitted URL, page title, page description, and creation timestamp.

During this long-term migration, up-to-date derived files, like cdx.gx, might not always be available, but they will be created at a later date.

browserwatchtg commented 1 year ago

@rebeccacremona in my opinion this is still pretty useless, since is just storing TB of data with no a goal.

Let we say there are 1000 files/archive each day, with more than 1 MB each, this would mean 1-10GB each day and up to 5TB each year. Now except the storage usage, this even means 365k archives available in 365 different urls without no way to make a reverse search and find it. In reality is much more, because 2022 there were 500K+.

Ok you store all data and infos, but such infos cannot be searched and used, means it's like doing anything, except keep TB of data on Wayback for nothing. Perma.cc links are not indexed by Google too and with Wayback they were indexed in part. So such data cannot be searched on Wayback, Google, your website or other sites. Means only the person who archived such infos know how to reverse find the backup, in case Perma.cc get shutted down (is not the goal, but you probably understand what i want to say with that). To find such infos is even time consuming, since first you need to check the exact date when you archived that (most people probably are not keeping a backup of such infos ... like i'm doing, so they will never know that) then you need to search such date on wayback and then the correct perma.cc links.

So what i want to say is: 1) maybe you should add google index of perma.cc archives (and if the problem is that then there are to many perma.cc results hiding your main website, change the archive to pm.cc/42637 or whatever) 2) implement a better search feature 3) in case is not possible to upload 1mio archives every year on wayback (i think this is the problem why you are doing such changes), at least the option 2 should be available (or create a year, 10-year summary file with all archives)

Obviously the old single perma.cc to single wayback link was the best, but probably not possible anymore...

mdellabitta commented 1 year ago

@browserwatchtg Thank you for your input.