Improve code quality and secure dependencies

neroden / timetable_kit

A Python toolkit for generating human-readable timetables from GTFS data; uses PANDAS and gtfs_kit

GNU Affero General Public License v3.0

39 stars 2 forks source link

Improve code quality and secure dependencies #11

Closed Borketh closed 1 year ago

Borketh commented 1 year ago

Hello, Nathanael and Christopher!

I was poking around the repository because I'm interested in making VIA-style timetables using this tool (as opposed to VIA timetables in the Amtrak style) and I noticed a few things I could contribute. If you feel that any of these changes are of no use or contradict your personal code styles, that's fine. I can pick through this and remove any offending changes before you accept the rest.

Most commits have comments in them, but as a general summary, this PR includes:

Safer dependency specification

Instead of having every dependency specified with >=, I changed it to ^ so that major revisions do not get pulled by poetry without the code being ready for them. For example, ^1.4.1 is good for 1.X.Y but excludes 2.X.Y until the dependency spec is modified.

Code quality and performance

`str.join`

Wherever str.join was used without a variable number of arguments, I have replaced it with one of the following:

A single string literal when that's all it is.
An f-string when variables are involved (while retaining multiline html tag argument format for readability).
Concatenation when there are inline comments that should be preserved and few literals.
Arguments to print when that's the next step, because it stitches them together anyway.

While the runtime and memory usage of timetable_kit isn't critical, every time str.join is used here, a new temporary list is allocated for all those strings before assembling the final result. This isn't necessary, especially for places where all that is being joined is string literals.

Additionally, where str.join is actually used, I have endeavoured to use generator expressions instead of allocating multiple intermediate arrays. Since str.join and similar functions only need Iterable, these work just fine.

Set declaration and usage

Where a collection was declared that was only used in set operations later, or where the set function was called on an array literal, I have simplified that to just a set declaration (which uses the same brackets as dict but without colons)

Where sets were combined by unpacking one or more sets into an array before converting that into the output set, I have simplified that into a set union operation (|). This avoids creating a relatively large and useless array as well as rechecking the elements of each set against one another while merging.

Imports and feed_enhanced shenanigans

Some files imported things they didn't need anymore, those have been cleaned up. Some files didn't import what they need, so I have imported the right things (at least I think).\

I have also refactored all uses of the feed_enhanced "monkey-patch" to use a subclass of gtfs_kit.Feed called FeedEnhanced that implements all of the extra filter functions that were previously added to gtfs_kit.Feed through the cursed magic of Python. There is also a .pyi stub file that tells type checkers that the members carried over from gtfs_kit.Feed exist. The weird way that they take a shortcut by looping through locals and using setattr doesn't play nice whatsoever.

Other

Reduction of the number of long if blocks when boolean operations or single-line conditional assignment can be used instead.
I ran black on the whole repo, albeit with one extra argument that prevents weird stuff such as in multiline f-string delcaration of <img> tags. Without this, it wants to make any line of string literal that doesn't contain double quotes into a double-quoted string.
A minor addition to the .gitignore so that my JetBrains IDE doesn't add a bunch of useless things to the repo.

neroden commented 1 year ago

OK. I can take some of this and not other parts.

So first of all, here are some of the unacceptable changes: -- Don't change the dependencies >=. The project is enough of a work in progress that I don't want to accidentally get myself stuck to an old major version; if a new major version comes out, I'm not going to notice unless the build breaks. -- I don't like f-strings. I have trouble reading them and they are bug-prone. Looking at what you've done, they are more readable for the debugging output which is going straight to "print", for which tiny format bugs are not important, so these are OK.
-- But there must be absolutely no f-strings in the code which is generating the HTML and CSS; they are too mistake-prone in the extremely fiddly HTML-generating code, or for that matter for generating the pdftk command. Please remove all f-string changes which are not sent directly to "print" for debugging output. -- I don't like implicit string concatenation either. Don't do it. Your changes in baggage_img_str, for instance, are less readable than what I had and more error-prone. -- Don't replace "join" with string concatenation when there are more than two strings to concatenate. Everything I've read indicates that "join" is significantly faster than a + b + c. -- Don't replace "join" with print with a 'sep=""' argument. I find it irritatingly easy to miss the sep argument to print, and the "join" is easier for me to read. -- Do not change the names of the arguments passed to functions to be prefixed with underscores. It is not yet clear which arguments are going to be part of the public API -- technically everything is internal implementation details right now -- but you've definitely made some wrong choices here, several of the things you prefixed with underscores are probably going to be part of the final API. -- Don't do double-nested list comprehensions (with two "fors" in the list comprehension). I find these unreadable which is why there's an old-fashioned for loop there. -- I must review any replacement of traditional control flow by a list comprehension individually to decide which I consider more readable. I usually find traditional control flow more readable except in special situations. -- it's "reroute", not "rerouting", in US rail terminology -- Unless there's an extremely strong Python-standard-usage reason to change it, amtrak/station_url.py should refer explicitly to amtrak.json_stations; there are likely to be other json_stations in future. -- When I generate an named intermediate variable and immediately use it, as with logo_css_list, this is intentional. Do not attempt to "simplify" it by packing it into one statement.

Now to the good: -- I very much appreciate subclassing the gtfs_kit Feed; I really hadn't got the hang of subclassing. I will have to verify that literally everything works, but this is definitely the direction to go. -- Thank you for straightening out the imports (requirements have been changing fast) -- Thank you for the comment typo fixes. -- The direct set manipulation is good. -- Removing the "+" or "join" from print statements where the strings are concatenated with the default single space is good.

Now to the mixed: -- I appreciate breaking out train_number_range as a function, but you didn't document it. ALL functions get a docstring. -- breaking out parse_station_name to avoid duplication is good, but it can't go into generic_agency. It's actually an Amtrak-specific parsing routine (despite any appearances to the contrary) and belongs in amtrak/station_names.py. And I don't like your implementation of parse_station_name with regular expressions; it's unreadable, and may be incorrect (depending on whether Amtrak starts using weird characters in the station names, which they might). Keep my implementation. Also, you failed to give it a docstring. Every function gets a docstring.

I do appreciate your interest in timetable_kit! Sorry this is mostly negative -- most of your smaller changes are directly contrary to my preferred coding style and I find them unreadable and prone to generating errors.

I especially appreciate breaking out the EnhancedFeed subclass of Feed (although it's going to need extensive testing since literally every tool depends on the monkey-patching) and the cleanup of the set manipulation. If you could break these out as individual patch submissions I would appreciate it.

To explain my coding style: I prioritize readability for me over any considerations of memory usage or speed. I will optimize for speed when I have to, but if it isn't on the critical path in the inner loop with a measurable, verified, and significant runtime difference, I will always do the slower version which I can read better. I happen to find join([]) particularly easy to read -- and f-strings particularly unreadable when I'm checking every single piece of whitespace as I have to with the HTML and CSS generation.

neroden commented 1 year ago

Oh, I do have one other question: what do you mean by "VIA style"? Do you have an example of the desired look?

One of my goals was to make the output look more customizable, so I'd be interested in doing this. There's a lot of stuff I need to factor out and abstract in the parts of the HTML which go "around" the actual timetable proper; too much of it was hardcoded to get the working prototype working (while including all the appropriate disclaimers). There's also one exceptionally finicky issue with styling column headers where PANDAS doesn't provide what I need; I haven't felt up to patching PANDAS. Most of it is extremely CSS-adjustable though.

juckins commented 1 year ago

I am a novice python programmer who has more experience running things than coding things, so I'll leave the code format comments to neroden. I agree that, for me, readability is preferred over speed - unless the time savings are truly significant (like tens of seconds). Regardless, I am learning a lot from the discussion here and appreciate the dialog.

Borketh commented 1 year ago

Hi again, and sorry for the delay.

I'm glad that this is somewhat useful, and I concede most of your issues with some of this. I have a couple things I'd like to expand on and a few questions.

Code Bits

I take issue with the >= thing because it may be a pitfall for an end user who has no knowledge of the project and can't get it to work out of the box. In general, unless there's actual need for new features or the major revision of a dependency has security flares, you don't need to use any new version just yet. Besides, there are better ways to stay on new versions, such as GitHub Dependabot. Heck, even my code editor was telling me there were better minor versions that could be specified.

With respect to f-string vs join, the idiom that join is the fastest way to concatenated is outdated. There have since been many improvements to them that make them faster. However, I understand that they may be less readable in certain situations.

Tangentially related: if html construction is hard to get right, why is it done manually? There are hundreds of libraries that will do that for you with less hassle and more readability. The same likely goes for the CSS, although I haven't looked at that yet.

I apologize for the lack of docs on the helper functions. I always forget to document functions, especially smaller ones.

Next Steps

The refactor of the Amtrak station name parsing could be put in the Amtrak folder, but I have a bigger question. What's the end goal, structurally? There are the makings of a generic framework and specialized implementations, except that there's a whole bunch of duplicate code across current agencies that ignores that paradigm

This PR is not going through in its current state, so how do you want me to add what you do want? Amending a bunch of stuff directly in here would make commit history a mess, so I could either make a new one with just the good stuff, or make one for each kind of change.

Borketh commented 1 year ago

Oh, and this is what I meant by the VIA style timetable PXL_20230801_054823690~2 They don't make these anymore (it's dated Nov 5 2017), so it's probably some sort of historical artefact now...

On that note, I'm not the one who wants these. I've been asked to look in to doing this for an organization (that I should actually check whether I should name at the moment), but regardless I wanted to ask about the hardcoded "This was made for the RPA" text. If other organisations are to be able to use this tool for their own purposes, they should be able to say they made it (of course, still giving credit to the tool). Additionally to be able to match this style I would need to be able to have multiple footnote areas for both English and French (because Canada).

neroden commented 1 year ago

If you could prep separate patches for each type of change ("feature patches") I would appreciate it. It's much easier to review. I admit that I haven't been committing stuff to this repo that way myself, but that's because it's still a pre-alpha and it's my repo so I get to be sloppy.

One thing that I would like some help with -- installation practices or structure. I'd ideally like this to become a package which can be installed with the individual tools being scripts which can be run, and... I don't really think I've got the package structured for installation and use. (I have 40 years of coding experience in a dozen languages, but no Python packaging experience. I have been slowly shifting this out of "overgrown script" status but it's not really packaged.)

If I can get it to the point where it's actually reasonable to install it as a package and run it that way, rather than the current muddle of running it from a working-directory virtualenv with a poor distinction between code, data, and output, I'll change the version number dependencies to ^.

The HTML construction is done manually because I haven't found a library which is both lightweight enough and precise enough. Most of the work is text construction, which has to be done by hand anyway. The persnickety bit, it turns out, is getting the whitespace (or lack of it) exactly right, which means that a coding style where the whitespace is more visible than the actual text is useful -- that's what led to the "join" style, really. I also often /diff/ the resulting HTML to check for changes due to the GTFS so I need to maintain consistency in the nonprinting whitespace too. Most coding styles render the whitespace less visible than the non-whitespace, and I need the whitespace to jump out at the reader of the code. I realize that this is non-obvious.

The situation with the duplicated code is weird -- it's largely due to having several "agency" modules which are basically derivatives of Amtrak, "Amtrak plus" if you will. They should directly reference and use the Amtrak code, with the amtrak module loading first. This applies to "maple-leaf" (which is a hack to deal with the fact that VIA has wrong data for the US part of the Maple Leaf while Amtrak has wrong data for the Canadian half), and "hartford_line", whose GTFS basically describes a supplement to the Amtrak service. When I started this I wasn't expecting to do so many "Amtrak-derivative" agencies but it does seem to be a thing.

Common code between the via/ and amtrak/ modules, however, should be broken out as generic. There's some restructuring needed around wheelchair access info which is a large job I haven't started (the Amtrak code should be patching the GTFS, and then the GTFS values should be used everywhere).

The RPA text is controlled with a flag in the .json files, it's easy to turn off.

The multiple footnote text will take more work. The HTML which is wrapping the main timetable is Jinja2 templates in templates/ and is Jinja2 templates, but I'm not satisfied with the relative lack of modularity in the design here.

Thanks for the VIA classic timetable link! Regarding some of the features of this: -- The HTML surrounding the main table, footnotes etc. would have to be substantially redone. This is all pieced together from Jinja templates in the templates/ folder, but I don't like how modular it is (or isn't) right now. Making this more modular so that a different-look "main HTML/CSS" template could be specified at the command line when generating the timetable, or in the .json file, is a goal. -- I think we could modularize out the day specifications, controlled by an option in the .json file, to use VIA's 1234567 scheme instead of the much more readable MoWeFr scheme, though I'm not sure it's a good idea. I have made artistic changes to make my "Amtrak-style" timetables more readable than the original Amtrak timetables, and this is a case where I don't think much of VIA's approach. -- Using 24-hour times is mostly already programmed in, though I haven't tested it in a while and should check that it's triggered from a switch in the .json file -- Putting the AR/DP in its own column (as an option) is something I've been meaning to program and had half-programmed (it is closer to the original Amtrak style too) but haven't done -- Alternating light and dark on rows is somewhat finicky but could probably be done by creating another mix-in CSS class or two and having a routine run through the table to assign them. It would be a good feature. -- The font is designed to be changed, though not currently without changing the Python code. I need to come up with a good interface for specifying that. That said, the font I included is probably as good as it gets. -- The row showing whether business class is available is something I need to write for Amtrak too ("services" row), just haven't actually done it. -- Most of the rest of the decor in the table is just plain CSS choices and is in the Jinja2 templates. Again, this isn't modular enough right now, but if I can get the framework more modular, it should be easy to change the actual decor.

Borketh commented 1 year ago

I'll get right to proper individual PRs soon, and I can whip up a prototype "properly packaged" structure for you to have a look at if you want. Can you make a GitHub issue for each of those bullet points you want? That way we can address each of them with PRs as necessary.

neroden commented 1 year ago

Unfortunately Real Life (urgent family medical stuff) is about to prevent me from being able to do any work on timetable_kit to speak of for another several weeks. I just sort of wrote that last thing off the top of my head. I'd love to get back to it and structure it as GitHub Issues but, I'll be honest, it ain't happening in the next several weeks.

This medical stuff has been taking up massive, inordinate amounts of my time. The problem is that doctors are infecting their patients with Covid, so finding a medical office which won't do that is very difficult and requires long travel, and of course the hotels aren't generally safe either.

I wrote a peer-reviewed, citationed handout for the doctors (it's here: https://whn.global/doctors-should-not-infect-patients/ ).

But until the world starts wearing respirator masks and installing HEPA filters, my time for working on programming projects at all is going to be very interrupted.

Borketh commented 1 year ago

That's alright, I'll do most of the work up to where it needs your approval. I wish you the best in your medical shenanigans!

neroden commented 1 year ago

In case you're wondering, now that I realized that I have multiple "amtrak derivative" agency modules, and at least one "via derivative" agency module, and code which generically works for some-but-not-all agencies, and some of that code depends on reading data in the feed (sigh), I am going to restructure the agency modules to put the code into classes, so I can use inheritance. This is a bit of a pain and will take a while. In order to avoid breaking things, I'm going to move over a few functions at a time. I'm working on the "singleton" branch, since for the first implementation each agency gets a singleton.

neroden commented 1 year ago

If you could break out the improvements to set handling in special_data.py next, I'd appreciate it.

neroden commented 1 year ago

OK, I'm mostly done with the primary object-oriented refactor. I had some very particular goals in how I did this which may not be obvious. There's certainly room for additional refactoring, but I got rid of most of the code duplication.

I'm still unsure how to best do the GTFS acquisition. I want to keep the parts where the GTFS is actually downloaded separate from the rest of the program so that the program can be run offline, with old GTFS data, with hacked GTFS data, with sample GTFS data, etc. The only thing it really has to share with the rest of the program is the data unpacking location, though the download URL may be appropriate too. This gets into questions of how to do the data/code separation -- if this is to be installable as a package, this separation needs to be done properly with a sensible design for the GTFS data directory locations, something I haven't even started in on.

I was poking around at some of what you've been doing, and one note: The connecting services data tables are lists, not sets, because order matters.

Borketh commented 1 year ago

OK, I'm mostly done with the primary object-oriented refactor. I had some very particular goals in how I did this which may not be obvious. There's certainly room for additional refactoring, but I got rid of most of the code duplication.

Oh... I have one already queued up to go.

Borketh commented 1 year ago

Given the volume of divergent code I don't think it would be feasible to use mine anymore, though. You can still definitely check it out for inspiration I guess. It's on my repo in the crazy-oo-refactor branch

Borketh commented 1 year ago

@neroden is there an easier way for us to discuss this project? Sporadic comments on a long-dead PR aren't really ideal. I see you've pulled the rest of my changes manually.

neroden commented 1 year ago

@neroden is there an easier way for us to discuss this project? Sporadic comments on a long-dead PR aren't really ideal. I see you've pulled the rest of my changes manually.

Do you have an email address? Mostly I've been discussing it with Chris Juckins by email, but I don't have an email for you. My email's in the pyproject.toml file.