ipfs-inactive / archives

[ARCHIVED] Repo to coordinate archival efforts with IPFS
https://awesome.ipfs.io/datasets
183 stars 24 forks source link

arXMLiv / CorTeX #31

Open davidar opened 8 years ago

davidar commented 8 years ago

@kohlhase @dginev I'd really like to get arXMLiv working with our arXiv corpus (ipfs/archives#2), as it would really help with ipfs/apps#1 (including ipfs/apps#5). Would you be able to help with this?

Cc: @jbenet @brucemiller

kohlhase commented 8 years ago

note that arXMLiv is stale for quite a while. @dginev is working on (and making good progress) on the successor system CorTeX (see https://github.com/dginev/CorTeX/). So we are not supporting arXMLiv any more.

kohlhase commented 8 years ago

I would like to find out more about what you are doing though.

jbenet commented 8 years ago

@kohlhase backing up all critical human knowledge, and making it easy for others to help us make replicas. lots to do, and @davidar can fill in more, but worth getting a sense of what IPFS is -- maybe see https://ipfs.io

davidar commented 8 years ago

Hi @kohlhase! This article by @kyledrake gives a good overview of the goals of the IPFS project. Here at the archives subproject, we're particularly interested in the storage and distribution of Open Access publications --- such as (the Creative Commons subset of) arXiv --- and Open Data repositories.

The reasons for this are twofold:

  1. to prevent single-points-of-failure, by allowing anyone to mirror resources like arXiv
  2. to make it easier for people to remix and do cool stuff with Open Content

Number 1 is possible right now, as all Creative Commons articles from arXiv (current as of a month or two ago) have been added to IPFS, allowing anyone running an IPFS node to contribute storage and bandwidth towards distributing these publications.

For the second point, I'm currently working on improving the integration of scientific publications and the web (beyond PDFs). What I'd like is for the arXiv publications to be able to be accessed --- and more importantly linked to --- like any other webpage, including section/paragraph-level anchors, reducing the friction of following citations, public annotations/comments, improved readability on mobile devices, etc (see https://davidar.io/TeX.js/ and https://github.com/davidar/TeX.js). This is where a project like arXMLiv/CorTeX could really help.

dginev commented 8 years ago

Hi guys! I had read about the idea to move to hashes instead of URLs in "When Google met WikiLeaks", but I hadn't realized there is a serious ongoing effort to create that world. Very cool!

The first bit of information to bring to this conversation is that while arXiv is OpenAccess for readers, it actually has a restrictive redistribution clause. But now I read your #2 issue and I see you have limited the mirror to the CC-licensed papers, nice. Ok.

As @kohlhase described, arXMLiv has been stale for some time, and I am currently in a (rather gradual) process of regenerating it with a new framework and the latest arXiv data snapshot. I think once the data is back online at the new site and I hook up the metadata information, it would be interesting to look at a sensible way to connect to IPFS. However, note that the CorTeX framework is as much about a web version of arXiv, as it is about helping improve the toolchain for creating those HTML files, currently the LaTeXML converter. And if you're accessing files on the basis of their hash, there is an interesting change management question to be answered, as each rerun would almost certainly create a resource with a new hash (at the very least we have timestamps, but there are many other bits that improve and change over time). We plan to rerun often.

As to your TeX.js project, adding a bit more information about the purpose and scope of the technical side of the project would be helpful to understand exactly what you are building there. I am assuming it currently fine-tunes the appearance of TeX-written converted documents in HTML? I have seen you have LaTeXML.css in the repository, which is a great call, Bruce Miller has done a fantastic job with mapping out the styling hooks of LaTeX-based macros to the final HTML classes.

In any case, I am curious to have this conversation, let's try to see if there is a common sense way forward. Could we think of IPFS as a cloud backup/mirror service for our arXMLiv data?

davidar commented 8 years ago

I think once the data is back online at the new site and I hook up the metadata information, it would be interesting to look at a sensible way to connect to IPFS.

Great, looking forward to it :)

And if you're accessing files on the basis of their hash, there is an interesting change management question to be answered, as each rerun would almost certainly create a resource with a new hash (at the very least we have timestamps, but there are many other bits that improve and change over time). We plan to rerun often.

That's fine, IPFS can handle mutation through IPNS (which is roughly equivalent to branch pointers in git). However, I would recommend that things like timestamps only be updated when the content actually changes --- or better yet, having a single timestamp in the root directory rather than embedded in each file (I assume that's what you meant, IPFS already ignores filesystem timestamps for this reason) --- as rapid changes will cause difficulties in keeping mirrors up to date. To put it another way, IPFS works best with deterministic build processes.

As to your TeX.js project, adding a bit more information about the purpose and scope of the technical side of the project would be helpful to understand exactly what you are building there.

TeX.js doesn't do anything particularly novel (yet), it's mainly about combining a bunch of existing libraries into a single easy-to-use package. It's still in the early stages, but I will definitely add more description once it stabilises.

I am assuming it currently fine-tunes the appearance of TeX-written converted documents in HTML?

Yes, as well documents written directly in (plain/unstyled) HTML (or converted from another markup format). The goal is to ensure (roughly) the same level of typographical quality as PDFs produced by (pdf)TeX.

I have seen you have LaTeXML.css in the repository, which is a great call, Bruce Miller has done a fantastic job with mapping out the styling hooks of LaTeX-based macros to the final HTML classes.

Yeah, the output produced by LaTeXML is really great for custom styling. The LaTeXML.css file in the repository currently only modifies the style of (foot)notes (in addition to the non-LaTeXML-specific styling in main.css), but I'd like there to be tighter integration where appropriate in the future.

Could we think of IPFS as a cloud backup/mirror service for our arXMLiv data?

Yes, definitely. A distributed CDN also.

PS: @kohlhase @dginev You're welcome to join us at freenode#ipfs (or http://chat.ipfs.io ) to discuss further :)

jbenet commented 8 years ago

The first bit of information to bring to this conversation is that while arXiv is OpenAccess for readers, it actually has a restrictive redistribution clause. But now I read your #2 issue and I see you have limited the mirror to the CC-licensed papers, nice. Ok.

We would love to get Arxiv.org itself setup on IPFS -- so that it is distributing through IPFS. This is very much a new model of doing the web, so it's going to require some careful thinking + stepping. Though note this would be a huge step forward for open access journals, as it would allow people to create replicas and ensure the survival of the material, should terrible things happen. We'll be reaching out to people at Arxiv as the time goes on. If you know people we should talk to, please let us know.


(As a note of urgency, it's not actually at all guaranteed that we won't have terrible disasters this half-century that may wipe out major information caches -- we're trying to prepare for the worst. -- by the way, i'm appalled that there isn't a real "Plan to Backup All Human Knowledge" that we are actively undertaking. There's suggestions, but nothing concrete yet... (we can't even agree on long term storage media...) Anyway, Arxiv papers -- i think -- are a critical resource to back up first)

RichardLitt commented 8 years ago

As a note of urgency, it's not actually at all guaranteed that we won't have terrible disasters this century that may wipe out major information caches -- we're trying to prepare for the worst. -- by the way, i'm appalled that there isn't a real "Plan to Backup All Human Knowledge" that we are actively undertaking. There's suggestions, but nothing concrete yet... (we can't even agree on long term storage media) Anyway, Arxiv papers -- i think -- are a critical resource to back up first

backupeverything.io is free. Let's buy it and point it to a discussion repository.

davidar commented 8 years ago

@dginev By the way, http://cortex.mathweb.org/ is giving me an internal server error :/

dginev commented 8 years ago

@davidar that's OK, it wasn't supposed to be working yet. Check it again in a week from now.

jbenet commented 8 years ago

seems to be done :)

davidar commented 8 years ago

Yay :)

@dginev Is there a rough ETA on how long the conversion will take?

dginev commented 8 years ago

The first conversion took just under 7 days for 992k documents from arXiv. The current state is already a second stability run, that addresses various problems with the peripheral machines (e.g. our large compute cluster is missing an image-processing binding, etc.) So we're not running at full capacity at all right now.

Hopefully we'll have a stable state a couple of weeks from now (I am working on this part-time, mostly on weekends, so it's a bit spread out in time).

davidar commented 8 years ago

@dginev Cool, ping me when you're ready :)

davidar commented 8 years ago

@dginev Just thought I'd check in on how CorTeX is going?

Let us know if there's anything we can do to help.

CC: @mekarpeles

dginev commented 8 years ago

Thanks for asking! I just started a partial rerun, took forever to stabilize our software setup at the compute cluster at Jacobs University. There is still stability work that will need doing, but hopefully it goes much smoother. You can see the current rerun here: http://cortex.mathweb.org/corpus/arXMLiv/tex_to_html

If you want to track the project, you can subscribe to our arXMLiv mailing list where I send announcements: http://lists.jacobs-university.de/mailman/listinfo/project-arxiv-xml

There is a list archive, but you also need to be a member to read it, it seems.

Also thanks for the offer to help, the real help needed for that project is improving LaTeXML's coverage of arXiv, CorTeX is already quite usable and getting incrementally stable.

dginev commented 8 years ago

@davidar could you share with me a list of arXiv IDs that are licensed under some version of CreativeCommons?

We are currently on hold on an email thread with the arXiv.org team, discussing the redistribution rights we have for our HTML-converted versions of their documents. However, for CC docs the discussion doesn't even need to take place, as they clearly allow redistribution of derivative works. So if you have a compiled list of IDs for those works, I'd be happy to start working on bundling a public dataset with what we have now. There are just over 950,000 files that have an HTML equivalent at the moment, although it is an ongoing project to ensure the quality of said files.

But it's a start, and the resulting corpus will benefit not only IPFS but the wider scientific community interested in NLP over scientific documents.

davidar commented 8 years ago

@dginev Sure, I made a list here. All the punctuation in identifiers has been replaced with slashes, but it shouldn't be too hard to convert them back.

By the way, I asked @mekarpeles to invite you to ArchiveLabs/OpenJournal, but wasn't sure if you'd received it. We're trying to get together people and organisations broadly interested in open access (from data processing and distribution through to frontend interfaces) to coordinate and share experience and resources. I think your experience with projects like arXMLiv and CorTeX would be of interest to a number of people :)

dginev commented 8 years ago

Oh wow, only 17426? That's about 1.7% of all of arXiv, quite humble. Thanks for the list!

As to an invitation, I don't remember receiving one, but it sounds like a worthy initiative.

davidar commented 8 years ago

Yeah, unfortunately the vast majority of papers just have the default arXiv license

dginev commented 5 years ago

Hi all. I randomly remembered this issue existed and wanted to check if there is something more we can do to assist the IPFS archives effort.

The cortex system has gotten a bit better, and we are still actively converting arXiv to HTML5 (and are now publishing annual datasets). The old links should still resolve with redirects, but the main arXiv conversion site is now at: https://corpora.mathweb.org/corpus/arxmliv/tex_to_html

If the issue is still relevant, I would need an updated list of CC-licensed articles and we can discuss how to manage a file transfer.