jvanasco / metadata_parser

python library for getting metadata
Other
141 stars 23 forks source link

Creeping featurism and adding one or two more `page` fields ... #30

Closed mzpqnxow closed 11 months ago

mzpqnxow commented 4 years ago

Hello, first thank you for your work on this, I was very happy to find it. Unfortunately for me, I went through the pain of writing something very similar 2 days before stumbling upon it- probably similar to your implementation, it involved a mishmash of bs4 and regex to try to intelligently handle malformed/non-compliant HTML as well as just take whatever bs4 could cleanly pick up on the first try. I trust your implementation much more because of how hastily I wrote mine and have switched over to it and it's working great and chopped out a lot of cruft

However... there are a few (minor) things I was grabbing from the page in addition to the title that you chose not to grab. For my use (fingerprinting specific technologies based on HTTP protocol info and HTML content) I found that grabbing h1, h2 and h3 came in handy in a bunch of my cases. Right now, I switched over to metadata_parser to get the title and the meta tag data, but I still have a bit doing the bs4 and re.search logic for those hn tags. Any thoughts one way or another on adding capture of these? I'd be happy to PR that if you would accept it as I would love to dump the last few remnants of my code from my project.

If you think this goes a bit too far beyond the scope of what you have in mind, no worries. I only even thought to ask because you pull the <title> tag content.

Something I find nice about your interface is that you thought to include clean direct access to the bs4 soup object via MetadataParser::soupso either way I can at least avoid the cost of parsing the page multiple times with bs4

Thanks again, appreciate your work on this

Ah, one last note- I went through the trouble of adding a function to parse the http-equiv refresh content value- not just to split out the seconds value and the URL or relative path value, but also, in the case where it's a URL, crack the URL into the urlparse() components. This is probably a bit much for most people but was important for my project. I really doubt you would be interested in adding this to your project (either yourself or as a PR) but happy to hear your thoughts either way. I was breaking it into something like this:

          "meta_refresh": {
            "refresh": "2;URL=https://www.somesite.co.za:2087/login"
          },
          "meta_refresh_cracked": {
            "path": "/login",
            "port": 2087,
            "username": null,
            "password": null,
            "params": "",
            "fragment": "",
            "refresh_seconds": 2,
            "scheme": "https",
            "private suffix": "somesite.co.za"
            "hostname": "www.somesite.co.za",
            "raw_url": "https://www.somesite.co.za:2087/login"
          }

For instances where it's a relative path instead of a full URL, I give that relative path a dummy scheme and hostname, to let urlparse() cleanly break out the params from the path. I was partially too lazy to manually split the path and query params and partially worried I might mishandle some bizarre edge-case that I trusted urlparse() with more. So I break down:

          "meta_refresh": {
            "refresh": "2;url=/account?param1=xyz",
            "Robots": "noindex, nofollow"
          },
          "meta_refresh_cracked": {
            "path": "/account",
            "port": null,
            "username": null,
            "password": null,
            "params": "param1=xyz",
            "fragment": null,
            "refresh_seconds": 2,
            "scheme": null,
            "hostname": null,
            "raw_url": "/account?param1=x"
          }

This is probably not worth much, but for some reason I felt compelled to "finish" the job- once I start cracking a field... I can get carried away and crack it as far as it can go. At least I didn't break the query parameters into a dict :>

jvanasco commented 4 years ago

Unfortunately, I am way too busy over the next week to address this stuff in depth, but i can give highlevel feedback.

  1. The refresh stuff seems somewhere between fine and too much. I need to look at it against some datasets. I do like the idea of parsing some of the info, such as...

    "meta_refresh": { "refresh": "2;url=/account?param1=xyz", },

    _meta_refresh_parsed: { "time": 2, "url": /account?param1=xyz", }

I'm not sure about the further parsing, how this should be supported by default, and how to opt-in/out. In our use-case, I wrote this package for a content search engine and we archive this information; a handful of "top users" who use this library are in the same situation. adding in an extra payload of redundant information throws off data storage calculations.

  1. The h2/h3/etc stuff seems okay, but I worry about things getting too carried away and how this may effect existing installs. It might make sense to use callback hooks or even a subclass model to do extra work. we did the latter in a PR a year or so ago. this way you could write a class like...

    class ExtendedMetadataParser(MetadataParser): def post_processing_hook(self, *args, **kwargs): ... custom code here...

and then .parse would be extended to do...

class MetadataParser(object):
    post_processing_hook = None
    def parse(self, *args, **kwargs):
        .... current code here ....
        if self.post_processing_hook is not None:
            self.post_processing_hook(*args, **kwargs)

Some of this could be built in too, and handled with subclass controls. There was a lot of feature creep a few years back, and the number of args/kwargs exploded. This has really pushed me towards looking more at subclasses and callbacks, so the library can support more use cases... but still remain usable.

mzpqnxow commented 4 years ago

Unfortunately, I am way too busy over the next week to address this stuff in depth, but i can give highlevel feedback.

Understood- thanks for taking the time to answer this

  1. The refresh stuff seems somewhere between fine and too much. I need to look at it against some datasets. I do like the idea of parsing some of the info, such as... "meta_refresh": { "refresh": "2;url=/account?param1=xyz", }, _meta_refresh_parsed: { "time": 2, "url": /account?param1=xyz", }

I'm not sure about the further parsing

Yeah, I agree with you here. Despite implementing that extra level of cracking in my code, I don't think that level of detail is appropriate within metadata_parser. I only really consider it reasonable to see the content split into the time and URL/path component- there's minimal value for that extra granularity for nearly everyone and it definitely wanders out of scope of the project in my opinion, going from parsing HTML into parsing and deeply interpreting/making meaning of the HTML. Doesn't seem appropriate for core metadata_parser functionality, the user can crack it after using metadata_parser to get the value if they want to.

  1. The h2/h3/etc stuff seems okay, but I worry about things getting too carried away and how this may effect existing installs. It might make sense to use callback hooks or even a subclass model to do extra work

I agree, I definitely do not want to go beyond h1, h2, h3, especially within metadata_parser code. Anything beyond that, using the extensibility approach seems like the right way to go. This would allow much less invasive changes to provide the flexibility for users to easily do this themselves, either via their own adapters/callbacks and/or via some that may be included as optional to use within the package- maybe one common adapter/subclass (mainly to function as an example for users) would be appropriate for inclusion if this was implemented.

There was a lot of feature creep a few years back, and the number of args/kwargs exploded. This has really pushed me towards looking more at subclasses and callbacks, so the library can support more use cases... but still remain usable.

I noticed that some of the function kwargs had started to look like pandas, which commonly has 10+ kwargs for their most ubiquitous and powerful classes. I bet pylint default settings loves that kwargs list :)

Correct me if I am wrong (if you have time) but what I am ultimately getting out of your comments, which were very helpful:

  1. I don't see any areas of philosophical or implementation disagreement, aside from the extensive cracking of the meta refresh- I cited in my first comment, but I happen to actually agree with you there. I think metadata_parser can serve to break the content into the time value and the URL/relative path, but anything beyond on that belongs in user code or as a special (optional) adapter

  2. h1/h2/h3 are suitable candidates for being bolted on to the code, but going anywhere beyond that with HTML body tags is too slippery a slope. Fully agree here and I actually concluded these were the only tags with universal appeal across my dataset- I didn't choose them arbitrarily, nor were they just a starting point for a laundry list of additional tags, so we're on the same page. Otherwise at some point you might as well just return the entire bs4.soup object :P

  3. In general, you want to move more towards providing a generic/cleaner way for functionality to be extended, via classes/adapters and/or callbacks/hooks, some of which may be included in the package for user convenience or to serve as examples. This provides both flexibility and makes the project easier to maintain in the long term and doesn't require adding more and more kwargs

  4. I don't know that you said this directly- but if there are opportunities for retroactively applying this design in some areas where it can provide a maintainability improvement, they're worth looking at. One example I can think of is replacing one or two of the kwargs with a class that acts as an adapter to influence the standard behavior. Though I would guess you would prefer to increment the major version of the package for changes like these as opposed to modifying the current major and incrementing the minor, even if they are mostly abstracted from the user

It doesn't matter to me whether you leave this issue open or not, it served its purpose as a communication medium and helped me quite a bit

I will probably fork master in the short-term and at some point send a PR for you to glance at and see what you think- at the very least for the h1/h2/h3 stuff, which is quick and easy. If you don't want to merge it at that time (or ever) it won't be a big problem for me, I can just pip install from my fork+branch. Or we can work out modifications. The more overarching changes will be a separate effort

Thanks again for the insight and the brief backstory on that initial expansion of the features. I can relate to being short on time and I realize this is not a sponsored project for you, so taking the time to write this note is much appreciated

mzpqnxow commented 11 months ago

3 years later we can probably lay it to rest 😃

Thanks again for the discussion!