Closed rockdaboot closed 5 years ago
I suggest splitting WARC code into a libWARC. This allows us to enable / disable WARC support without a lot of work in the codebase.
First step would be a second library within the Wget2 project - just because we do not have the man power to start another project. I know how much work libpsl was (and still is) - though it is just a dictionary lookup.
Something for GSOC ?
I'll agree. Libpsl has been more work than you'd have guessed.
Though a small library style code within Wget2 for WARC would be a good idea. That can later be split into a separate library.
For GSoC, I'm not so sure since most of the WARC code is already available in Wget. The amount of effort required to port it to libwget and clean up the API isn't enough to warrant 2 months of full time work.
For GSOC a WARC stand-alone library (including the tools / example code) plus Wget2 integration is more than enough. The main work always is about details. Not to forget the documentation.
And it could be pretty attractive... creating a shared library from scratch with auto* tools, portability, gnulib, understanding WARC, reading + modifying existing code, API design, integration into existing software, Travis CI (or Gitlab CI), Doxygen, Unit testing, ...
This is much work - and the one who masters it may claim himself having experience in attractive technologies. If I were a student, I would go for it.
Oh, alright. I thought you implied that only splitting the WARC code within Wget2 would be a GSoC project. That would not require a lot of work since the WARC code already exists in Wget 1.x and the student simply needs to port it to Wget2.
However, if the project is splitting it into a stand-alone library, plus sample integration inside Wget2, then it looks more like a full project. Though, I'm still not sure if it warrants 2 months of programming. The proposal submitted by students would require them to think about the API design and understanding WARC, so the actual GSoC term would include splitting the code and trying to make sure it remains portable. It indeed is looking more like a GSoC project at this point. Maybe, once we polish the idea a but more and write it down, it'll be easier to understand the time requirements of this.
What about a GSOC wiki page here (there still is no wiki on Savannah I guess ?) ? Or is there a Gnu affine Wiki that we can use ?
Last year we used the GitHub wiki on my Wget repository. It still contains some of the FAQs and basic information for GSoC with Wget. We do need to update some parts of that.
We could either keep using it, or move the pages to the Wiki here. I'm okay with anything. But, writing the idea Wiki pages will be nice.
Lets keep it there. We should clean up (moving 2015 stuff away) and add new pages.
So?
Hi! Has there been any progress on this? I've seen https://gitlab.com/gnuwget/wget2/-/issues/65#note_116131758, but has been 4 years ago from that.
Nothing much happened. The web archive team recently offered us their WARC extensions, but it is not straight forward to add into Wget2. One thing is that all the authors have to assign the FSF copyright license before we can accept non-trivial code from them. And this might be a blocker.
On a different page, the available time of the maintainers (incl. me) shrunk. So we can mostly just "maintain" existing code instead of "developing" new stuff.
The web archive team recently offered us their WARC extensions, but it is not straight forward to add into Wget2.
What "extensions" are these ?
Most code can be taken from Wget code. It should be polished (thread-safe, proper API for use by external applications).
Maybe this should go into a separate library / project with some tools to be created (creation, modification, querying of WARC archives).