rockdaboot / wget2

The successor of GNU Wget. Contributions preferred at https://gitlab.com/gnuwget/wget2. But accepted here as well 😍
GNU Lesser General Public License v3.0
575 stars 77 forks source link

Add proper WARC API + support by wget2 #65

Closed rockdaboot closed 5 years ago

rockdaboot commented 8 years ago

Most code can be taken from Wget code. It should be polished (thread-safe, proper API for use by external applications).

Maybe this should go into a separate library / project with some tools to be created (creation, modification, querying of WARC archives).

darnir commented 8 years ago

I suggest splitting WARC code into a libWARC. This allows us to enable / disable WARC support without a lot of work in the codebase.

rockdaboot commented 8 years ago

First step would be a second library within the Wget2 project - just because we do not have the man power to start another project. I know how much work libpsl was (and still is) - though it is just a dictionary lookup.

Something for GSOC ?

darnir commented 8 years ago

I'll agree. Libpsl has been more work than you'd have guessed.

Though a small library style code within Wget2 for WARC would be a good idea. That can later be split into a separate library.

For GSoC, I'm not so sure since most of the WARC code is already available in Wget. The amount of effort required to port it to libwget and clean up the API isn't enough to warrant 2 months of full time work.

rockdaboot commented 8 years ago

For GSOC a WARC stand-alone library (including the tools / example code) plus Wget2 integration is more than enough. The main work always is about details. Not to forget the documentation.

And it could be pretty attractive... creating a shared library from scratch with auto* tools, portability, gnulib, understanding WARC, reading + modifying existing code, API design, integration into existing software, Travis CI (or Gitlab CI), Doxygen, Unit testing, ...

This is much work - and the one who masters it may claim himself having experience in attractive technologies. If I were a student, I would go for it.

darnir commented 8 years ago

Oh, alright. I thought you implied that only splitting the WARC code within Wget2 would be a GSoC project. That would not require a lot of work since the WARC code already exists in Wget 1.x and the student simply needs to port it to Wget2.

However, if the project is splitting it into a stand-alone library, plus sample integration inside Wget2, then it looks more like a full project. Though, I'm still not sure if it warrants 2 months of programming. The proposal submitted by students would require them to think about the API design and understanding WARC, so the actual GSoC term would include splitting the code and trying to make sure it remains portable. It indeed is looking more like a GSoC project at this point. Maybe, once we polish the idea a but more and write it down, it'll be easier to understand the time requirements of this.

rockdaboot commented 8 years ago

What about a GSOC wiki page here (there still is no wiki on Savannah I guess ?) ? Or is there a Gnu affine Wiki that we can use ?

darnir commented 8 years ago

Last year we used the GitHub wiki on my Wget repository. It still contains some of the FAQs and basic information for GSoC with Wget. We do need to update some parts of that.

We could either keep using it, or move the pages to the Wiki here. I'm okay with anything. But, writing the idea Wiki pages will be nice.

rockdaboot commented 8 years ago

Lets keep it there. We should clean up (moving 2015 stuff away) and add new pages.

uis246 commented 3 years ago

So?

cgr71ii commented 1 year ago

Hi! Has there been any progress on this? I've seen https://gitlab.com/gnuwget/wget2/-/issues/65#note_116131758, but has been 4 years ago from that.

rockdaboot commented 1 year ago

Nothing much happened. The web archive team recently offered us their WARC extensions, but it is not straight forward to add into Wget2. One thing is that all the authors have to assign the FSF copyright license before we can accept non-trivial code from them. And this might be a blocker.

On a different page, the available time of the maintainers (incl. me) shrunk. So we can mostly just "maintain" existing code instead of "developing" new stuff.

Florents-Tselai commented 2 months ago

The web archive team recently offered us their WARC extensions, but it is not straight forward to add into Wget2.

What "extensions" are these ?