mahanama94 / link-parse

Python Package for link header parsing.
Other
0 stars 0 forks source link

Parsing link formatted TimeMaps almost works with RegexLinkParser, but not quite #1

Open shawnmjones opened 3 years ago

shawnmjones commented 3 years ago

One of the largest problems we have is parsing link-formatted TimeMaps. The RegexLinkParser gets farther than any other implementation I've seen, besides the solution I implemented in AIU, and even AIU has issues with the occasional TimeMap.

It parses a TimeMap into the appropriate relationships, but with the following unexpected behaviors:

Here's what I did with RegexLinkParser:

  1. start ipython
  2. import the RegexLinkParser
  3. copy the TimeMap from Figure 28 of RFC 7089
  4. paste the TimeMap into a variable named timemap
  5. parse it with RegexLinkParser
  6. print the results as shown in the Usage section of the link-parse README

Here's the code in ipython:

In [1]: from linkparse.regex_parser import RegexLinkParser

In [2]: timemap = """  <http://a.example.org>;rel="original",
   ...:     <http://arxiv.example.net/timemap/http://a.example.org>
   ...:       ; rel="self";type="application/link-format"
   ...:       ; from="Tue, 20 Jun 2000 18:02:59 GMT"
   ...:       ; until="Wed, 09 Apr 2008 20:30:51 GMT",
   ...:     <http://arxiv.example.net/timegate/http://a.example.org>
   ...:       ; rel="timegate",
   ...:     <http://arxiv.example.net/web/20000620180259/http://a.example.org>
   ...:       ; rel="first memento";datetime="Tue, 20 Jun 2000 18:02:59 GMT"
   ...:       ; license="http://creativecommons.org/publicdomain/zero/1.0/",
   ...:     <http://arxiv.example.net/web/20091027204954/http://a.example.org>
   ...:        ; rel="last memento";datetime="Tue, 27 Oct 2009 20:49:54 GMT"
   ...:        ; license="http://creativecommons.org/publicdomain/zero/1.0/",
   ...:     <http://arxiv.example.net/web/20000621011731/http://a.example.org>
   ...:       ; rel="memento";datetime="Wed, 21 Jun 2000 01:17:31 GMT"
   ...:       ; license="http://creativecommons.org/publicdomain/zero/1.0/",
   ...:     <http://arxiv.example.net/web/20000621044156/http://a.example.org>
   ...:       ; rel="memento";datetime="Wed, 21 Jun 2000 04:41:56 GMT"
   ...:       ; license="http://creativecommons.org/publicdomain/zero/1.0/",
   ...:       """

In [3]: parser = RegexLinkParser()

In [4]: parser_results = parser.parse(timemap)

In [5]: from pprint import pprint

In [6]: for result in parser_results:
   ...:     pprint(result.__dict__)
   ...:
{'datetime': '',
 'link_from': '',
 'link_type': '',
 'link_until': '',
 'relationship': 'original',
 'title': '',
 'uri': 'http://a.example.org'}
{'datetime': '',
 'link_from': 'Tue, 20 Jun 2000 18:02:59 GMT',
 'link_type': 'application/link-format',
 'link_until': 'Wed, 09 Apr 2008 20:30:51 GMT',
 'relationship': 'self',
 'title': '',
 'uri': 'http://arxiv.example.net/timemap/http://a.example.org\n      '}
{'datetime': '',
 'link_from': '',
 'link_type': '',
 'link_until': '',
 'relationship': 'timegate',
 'title': '',
 'uri': 'http://arxiv.example.net/timegate/http://a.example.org\n      '}
{'datetime': 'Tue, 20 Jun 2000 18:02:59 GMT',
 'link_from': '',
 'link_type': '',
 'link_until': '',
 'relationship': 'first memento',
 'title': '',
 'uri': 'http://arxiv.example.net/web/20000620180259/http://a.example.org\n'
        '      '}
{'datetime': 'Tue, 27 Oct 2009 20:49:54 GMT',
 'link_from': '',
 'link_type': '',
 'link_until': '',
 'relationship': 'last memento',
 'title': '',
 'uri': 'http://arxiv.example.net/web/20091027204954/http://a.example.org\n'
        '       '}
{'datetime': 'Wed, 21 Jun 2000 01:17:31 GMT',
 'link_from': '',
 'link_type': '',
 'link_until': '',
 'relationship': 'memento',
 'title': '',
 'uri': 'http://arxiv.example.net/web/20000621011731/http://a.example.org\n'
        '      '}
{'datetime': 'Wed, 21 Jun 2000 04:41:56 GMT',
 'link_from': '',
 'link_type': '',
 'link_until': '',
 'relationship': 'memento',
 'title': '',
 'uri': 'http://arxiv.example.net/web/20000621044156/http://a.example.org\n'
        '      '}
mahanama94 commented 2 years ago

Hi Shawn,

I somehow have missed the issue. I solved the unexpected newline, space character issue by adding a string strip.

In the case of other link parameters, there need separate implementations for each algorithm.

I'm keeping the issue open, until I'm able to complete for each algorithm.