web mentions for datasets with many URLs

agreiner commented 8 years ago

Posting from AC review comment, as requested by Sandro: Section 4.2 posits that "the receiver can fetch only the first 1mb of the page, since any reasonable HTML or JSON page will be smaller than that." While I agree that an HTML page larger than 1 MB is extreme, JSON data files can reasonably be larger. The suggestion of using a landing page for non-HTML content is helpful here, but even that effectively would seem to set an upper bound on how many URLs can be mentioned in a dataset. I realize it's late in the game to address this use case, but it would be nice to at least consider it. One workaround might be to break up long lists of mentioned URLs into multiple pages.

sandhawke commented 7 years ago

@aaronpk Any thoughts on this? How might that break up work? Like:

http://example.org/bigDataSet fetched as JSON is 200MB but fetched as HTML redirects to a 1MB first landing page at http:://example.org/aboutData/p1, which has rel=next link to /p2, etc, for hundreds of pages, and then... no, that doesn't work. One could do mentions from the pages, but that leave you with only links to the /p1 ... /p200 pages.

Basically, I think for this kind of thing you'd want to view bigDataSet as a database which can be queried, instead of as a flat file. But maybe there's some clever approach that would work....

kevinmarks commented 7 years ago

Fragments, or indeed fragmentions is one answer. We have implementations of fragmentions as targets that show by the paragraph.

On Wed, 7 Dec 2016, 19:24 Sandro Hawke, notifications@github.com wrote:

@aaronpk https://github.com/aaronpk Any thoughts on this? How might that break up work? Like:

http://example.org/bigDataSet fetched as JSON is 200MB but fetched as HTML redirects to a 1MB first landing page at http::// example.org/aboutData/p1, which has rel=next link to /p2, etc, for hundreds of pages, and then... no, that doesn't work. One could do mentions from the pages, but that leave you with only links to the /p1 ... /p200 pages.

Basically, I think for this kind of thing you'd want to view bigDataSet as a database which can be queried, instead of as a flat file. But maybe there's some clever approach that would work....

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/w3c/webmention/issues/83#issuecomment-265643167, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGCwECFMhIf4Q5tCINfkwWICuAZBm5Kks5rF3hQgaJpZM4K_cXQ .

sandhawke commented 7 years ago

So, if you have a big dataset file, and you want to use webmention, one technique is to also publish it as many small files (each under 1MB), each of which is webmentioned separately? And they can link rel=something to the main dataset. Is that it?

aaronpk commented 7 years ago

@sandhawke that sounds like good practice in general anyway. The first thing that websites do when dealing with large datasets is break them up into "pages" and offer a paging mechanism in the UI.

aaronpk commented 7 years ago

Please review the changes in #86 for some wording suggesting breaking up large datasets into pages when sending Webmentions.

sandhawke commented 7 years ago

I like it. @agreiner does that work for you? See: https://github.com/w3c/webmention/pull/86/files

w3c / webmention

web mentions for datasets with many URLs #83