Closed mzpqnxow closed 11 months ago
Unfortunately, I am way too busy over the next week to address this stuff in depth, but i can give highlevel feedback.
The refresh stuff seems somewhere between fine and too much. I need to look at it against some datasets. I do like the idea of parsing some of the info, such as...
"meta_refresh": { "refresh": "2;url=/account?param1=xyz", },
_meta_refresh_parsed: { "time": 2, "url": /account?param1=xyz", }
I'm not sure about the further parsing, how this should be supported by default, and how to opt-in/out. In our use-case, I wrote this package for a content search engine and we archive this information; a handful of "top users" who use this library are in the same situation. adding in an extra payload of redundant information throws off data storage calculations.
The h2/h3/etc stuff seems okay, but I worry about things getting too carried away and how this may effect existing installs. It might make sense to use callback hooks or even a subclass model to do extra work. we did the latter in a PR a year or so ago. this way you could write a class like...
class ExtendedMetadataParser(MetadataParser): def post_processing_hook(self, *args, **kwargs): ... custom code here...
and then .parse
would be extended to do...
class MetadataParser(object):
post_processing_hook = None
def parse(self, *args, **kwargs):
.... current code here ....
if self.post_processing_hook is not None:
self.post_processing_hook(*args, **kwargs)
Some of this could be built in too, and handled with subclass controls. There was a lot of feature creep a few years back, and the number of args/kwargs exploded. This has really pushed me towards looking more at subclasses and callbacks, so the library can support more use cases... but still remain usable.
Unfortunately, I am way too busy over the next week to address this stuff in depth, but i can give highlevel feedback.
Understood- thanks for taking the time to answer this
- The refresh stuff seems somewhere between fine and too much. I need to look at it against some datasets. I do like the idea of parsing some of the info, such as... "meta_refresh": { "refresh": "2;url=/account?param1=xyz", }, _meta_refresh_parsed: { "time": 2, "url": /account?param1=xyz", }
I'm not sure about the further parsing
Yeah, I agree with you here. Despite implementing that extra level of cracking in my code, I don't think that level of detail is appropriate within metadata_parser. I only really consider it reasonable to see the content split into the time and URL/path component- there's minimal value for that extra granularity for nearly everyone and it definitely wanders out of scope of the project in my opinion, going from parsing HTML into parsing and deeply interpreting/making meaning of the HTML. Doesn't seem appropriate for core metadata_parser functionality, the user can crack it after using metadata_parser to get the value if they want to.
- The h2/h3/etc stuff seems okay, but I worry about things getting too carried away and how this may effect existing installs. It might make sense to use callback hooks or even a subclass model to do extra work
I agree, I definitely do not want to go beyond h1
, h2
, h3
, especially within metadata_parser code. Anything beyond that, using the extensibility approach seems like the right way to go. This would allow much less invasive changes to provide the flexibility for users to easily do this themselves, either via their own adapters/callbacks and/or via some that may be included as optional to use within the package- maybe one common adapter/subclass (mainly to function as an example for users) would be appropriate for inclusion if this was implemented.
There was a lot of feature creep a few years back, and the number of args/kwargs exploded. This has really pushed me towards looking more at subclasses and callbacks, so the library can support more use cases... but still remain usable.
I noticed that some of the function kwargs
had started to look like pandas
, which commonly has 10+ kwargs
for their most ubiquitous and powerful classes. I bet pylint
default settings loves that kwargs
list :)
Correct me if I am wrong (if you have time) but what I am ultimately getting out of your comments, which were very helpful:
I don't see any areas of philosophical or implementation disagreement, aside from the extensive cracking of the meta refresh- I cited in my first comment, but I happen to actually agree with you there. I think metadata_parser can serve to break the content into the time value and the URL/relative path, but anything beyond on that belongs in user code or as a special (optional) adapter
h1/h2/h3 are suitable candidates for being bolted on to the code, but going anywhere beyond that with HTML body tags is too slippery a slope. Fully agree here and I actually concluded these were the only tags with universal appeal across my dataset- I didn't choose them arbitrarily, nor were they just a starting point for a laundry list of additional tags, so we're on the same page. Otherwise at some point you might as well just return the entire bs4.soup
object :P
In general, you want to move more towards providing a generic/cleaner way for functionality to be extended, via classes/adapters and/or callbacks/hooks, some of which may be included in the package for user convenience or to serve as examples. This provides both flexibility and makes the project easier to maintain in the long term and doesn't require adding more and more kwargs
I don't know that you said this directly- but if there are opportunities for retroactively applying this design in some areas where it can provide a maintainability improvement, they're worth looking at. One example I can think of is replacing one or two of the kwargs with a class that acts as an adapter to influence the standard behavior. Though I would guess you would prefer to increment the major version of the package for changes like these as opposed to modifying the current major and incrementing the minor, even if they are mostly abstracted from the user
It doesn't matter to me whether you leave this issue open or not, it served its purpose as a communication medium and helped me quite a bit
I will probably fork master in the short-term and at some point send a PR for you to glance at and see what you think- at the very least for the h1/h2/h3 stuff, which is quick and easy. If you don't want to merge it at that time (or ever) it won't be a big problem for me, I can just pip
install from my fork+branch. Or we can work out modifications. The more overarching changes will be a separate effort
Thanks again for the insight and the brief backstory on that initial expansion of the features. I can relate to being short on time and I realize this is not a sponsored project for you, so taking the time to write this note is much appreciated
3 years later we can probably lay it to rest 😃
Thanks again for the discussion!
Hello, first thank you for your work on this, I was very happy to find it. Unfortunately for me, I went through the pain of writing something very similar 2 days before stumbling upon it- probably similar to your implementation, it involved a mishmash of
bs4
and regex to try to intelligently handle malformed/non-compliant HTML as well as just take whateverbs4
could cleanly pick up on the first try. I trust your implementation much more because of how hastily I wrote mine and have switched over to it and it's working great and chopped out a lot of cruftHowever... there are a few (minor) things I was grabbing from the
page
in addition to the title that you chose not to grab. For my use (fingerprinting specific technologies based on HTTP protocol info and HTML content) I found that grabbingh1
,h2
andh3
came in handy in a bunch of my cases. Right now, I switched over tometadata_parser
to get thetitle
and themeta
tag data, but I still have a bit doing thebs4
andre.search
logic for thosehn
tags. Any thoughts one way or another on adding capture of these? I'd be happy to PR that if you would accept it as I would love to dump the last few remnants of my code from my project.If you think this goes a bit too far beyond the scope of what you have in mind, no worries. I only even thought to ask because you pull the
<title>
tag content.Something I find nice about your interface is that you thought to include clean direct access to the bs4 soup object via
MetadataParser::soup
so either way I can at least avoid the cost of parsing the page multiple times withbs4
Thanks again, appreciate your work on this
Ah, one last note- I went through the trouble of adding a function to parse the
http-equiv
refresh
content
value- not just to split out the seconds value and the URL or relative path value, but also, in the case where it's a URL, crack the URL into theurlparse()
components. This is probably a bit much for most people but was important for my project. I really doubt you would be interested in adding this to your project (either yourself or as a PR) but happy to hear your thoughts either way. I was breaking it into something like this:For instances where it's a relative path instead of a full URL, I give that relative path a dummy scheme and hostname, to let
urlparse()
cleanly break out theparams
from thepath
. I was partially too lazy to manually split the path and query params and partially worried I might mishandle some bizarre edge-case that I trustedurlparse()
with more. So I break down:This is probably not worth much, but for some reason I felt compelled to "finish" the job- once I start cracking a field... I can get carried away and crack it as far as it can go. At least I didn't break the query parameters into a
dict
:>