ReST table breaks if package description contains unusual characters

pohlt commented 4 years ago

Some packages have unusual characters in their description (e.g. poyo). These characters causes ReST tables to break their column alignment.

Filtering the input strings for regular alpha-numerical characters would fix this, I guess.

pohlt commented 4 years ago

I just saw that you actually let ptable do the table output. How about switching to ptable2 which is still actively developed and should have a compatible API. They might already have fixed this issue.

raimon49 commented 4 years ago

@pohlt Thanks for the report. Probably you are talking about " 🐓 " emoji.

In my Linux environment I got the following output:

$ pip-licenses -d --format=rest
+------+---------+---------+-----------------------------------------+
| Name | Version | License | Description                             |
+------+---------+---------+-----------------------------------------+
| poyo | 0.5.0   | MIT     | A lightweight YAML Parser for Python. ? |
+------+---------+---------+-----------------------------------------+

How is it broken in your environment?

raimon49 commented 4 years ago

How about switching to ptable2 which is still actively developed and should have a compatible API. They might already have fixed this issue.

If switching solves this issue, it's an attractive proposition.

However, there are many forked implementations of ptable on PyPI. Making a better choice among them is a difficult decision for me 😔

pohlt commented 4 years ago

Under Windows 10 (Python 3.7) I get

+------+---------+---------+-----------------------------------------+
| Name | Version | License | Description                             |
+------+---------+---------+-----------------------------------------+
| poyo | 0.5.0   | MIT     | A lightweight YAML Parser for Python. 🐓 |   <= under Win10, I see a chicken here and the table edge is shifted
+------+---------+---------+-----------------------------------------+

which is one character too much.

I agree: It's difficult to decide which prettytable fork is currently best maintained. PTable hasn't seen a commit for two years now and I'm worried it's basically dead.

I'll check if ptable really causes the problem and if ptable2 would fix it.

pohlt commented 4 years ago

There is an open issue https://github.com/kxxoling/PTable/issues/20 which describes the same problem. No reaction for 2.5 years.

raimon49 commented 4 years ago

@pohlt Thanks for the working example on Windows 10.

It was reproduced in my macOS environment. Also, I tried installing source code for ptable2 from the latest master branch. This issue has not been resolved.

If this issue can be solved by using ptable2 in the future, it is better to implement a compatibility layer like P-R in the past.

See also) #52

And when I'm sure ptable2 is stable and well maintained, I plan on migrating from PTable.

pohlt commented 4 years ago

After some testing and internal discussions, I would now propose to filter the data which is retrieved from pypi (enabled with a flag).

Rationale:

Fixing upstream packages turns out to be tricky without active maintainers.
Handling Unicode (even in Python) is not as straightforward as one would think and seems to depend on the OS.
Filtering would not be a huge burden (I'm quite sure I'm oversimplifying things):
```
"Hello 🐓 World".encode("latin1", errors="ignore").decode("utf-8")
```
latin1 could be a configuration setting.

Would you be willing to consider a PR?

raimon49 commented 4 years ago

Patch is welcome!! 😄

pohlt commented 4 years ago

Ok, now we just need someone who generates the actual PR! ;-)

I'd like to get your input/guidance before starting. What I plan to do:

Add boolean filter-strings command line option for turning on filtering (default: off)
Add some code to do the actual filtering (for all strings retrieved from pip)
Add string filter-codec command line option for selecting a code page from this list (default: latin1)
Add docs and tests

Does that make sense to you? And don't get your hopes up: I cannot commit to if and when this is happening. :-)

raimon49 commented 4 years ago

@pohlt All right, I'll express my opinion.

Add boolean filter-strings command line option for turning on filtering (default: off)

I think it's fine to adopt it as an opt-out or default behavior.

This is because we already do the equivalent when we load the package license file as a stream. https://github.com/raimon49/pip-licenses/blob/v-2.2.1/piplicenses.py#L148-L149

Note: However, we have adopted errors="backslashreplace" instead of errors="ignore" and replaced it with code points.

Add some code to do the actual filtering (for all strings retrieved from pip)

Yes, I agree.

Add string filter-codec command line option for selecting a code page from this list (default: latin1)

Add docs and tests

I agree with default: latin1.

Testing and documenting that users can specify the filter-codec command line option is going to be a challenge. If you can, make it a goal.

raimon49 commented 4 years ago

This issue has been resolved in pip-licenses 2.3.0 😄

raimon49 / pip-licenses

ReST table breaks if package description contains unusual characters #73