Closed afriedman412 closed 4 months ago
Thanks, @afriedman412. Could you provide a bit more context? Specifically, how does this differ from the horizontal_ltr=
and vertical_ttb=
parameters of .extract_words(...)
and related methods? Are there specific scenarios that those methods can't handle, but the proposed additions can? (Etc.)
OK so the idea here was to make it easier/more intuitive to manually control the direction in which text is read on both axes. If you have text that is rotated in some multiple of 90 degrees, you can just say "the words go right to left and the lines go top to bottom" and it will parse it correctly.
As I understand it, the horizontal_ltr
and vertical_ttb
params are specific to the direction of words but still presume the direction of the lines. More to the point, they both fundamentally control the same thing, so it's a little confusing to have them separate.
The bigger picture for me is that text direction on both axes is controlled the same way, by choosing which char
parameters we use to group lines and words, and those can be cleanly parsed from asking two basic questions about the text orientation when calling the function.
Does that make more sense?
Ah, got it! I do like the idea of being able to specify char/line direction with more granularity. In fact, I think this would be nice as a core part of the main extraction methods. Doing that will require a bit of code-surgery, so I'm going to take that on myself, but will credit you clearly.
Largely as a note to self, it sounds like there are a few different types of scenarios in which the reading direction of text on a page is not left-to-right, top-to-bottom:
pdfplumber
handles this well, but may sometimes require custom logic.Ah, got it! I do like the idea of being able to specify char/line direction with more granularity. In fact, I think this would be nice as a core part of the main extraction methods. Doing that will require a bit of code-surgery, so I'm going to take that on myself, but will credit you clearly.
I can do it when I have time (if you haven't already). But I made the standalone function as a way to soft launch the syntax, with an eye towards full implementation whenever.
- Pages that have been rotated. This feels like a nice-to-have but has been a lower priority; it seems like your suggestions here would sort of "automatically" handle that, which is nice.
Yeah I mean this issue is technically about issues parsing rotated text. And granted, a lot of this could be easily sorted upstream or downstream of text extraction, but it makes sense to put all text direction control in one place given how easy that is.
Good news: I've made some progress on incorporating this more deeply into pdfplumber
. There are still a few implementation details for me to iron out, but the general approach seems promising.
One wrinkle I realized: There are basically two variations of RTL text: (a) text that runs right-to-left for page rotation reasons (such as those in your examples) and (b) text in scripts/languages that naturally run right-to-left. In (a) the assumption, also reflected in the tests in this PR, seems to be that the user would want that text "fixed" in the output — i.e., for the output to read LTR. But for (b) I think it's safe to say that users would want the text to remain RTL.
I don't think there's an automated way to tell the difference between those two scenarios with high fidelity, so I'm planning to add two additional parameters — char_dir_output
and line_dir_output
, which would default to the values of char_dir
and line_dir
but allow the user to override that.
One other note: dir
as an abbreviation has some potential ambiguity, as it's often used to refer to a "directory". I'm OK with using here as short for "direction", but figured it might be worth brainstorming other potential naming conventions. Open to suggestions.
Thanks for the the input! Fully agree about using dir
as a keyword.
I haven't gone back and looked at the code, but I think the idea was to infer the "reading" direction from the line direction. Anyways, new params is fine with me, although I would suggest something like natural_char_direction
instead of output
. Feels a little more intuitive.
Thanks for the response! Re. natural_char_direction
: Interesting, I find that less intuitive. In my mind, at least, it raises the question: "natural" to what? Keeping on riffing: What about render_char_direction
/ char_direction_render
?
Thanks for the response! Re.
natural_char_direction
: Interesting, I find that less intuitive. In my mind, at least, it raises the question: "natural" to what? Keeping on riffing: What aboutrender_char_direction
/char_direction_render
?
yo in the interest of expediency im fine with whatever you think is best!
Now added in https://github.com/jsvine/pdfplumber/commit/850fd45 and available in v0.11.0
. Many thanks again!
Fixes #848 (partially)
Adds
extract_text_dir_sensitive
function to.utils.text
which lets the user specify which direction the lines and characters should be read.Because the syntax is new, I didn't want to just alter
extract_text_simple
but I can integrate the two if that would be preferable!