Poll: Is this library ready for 1.0?

k00ni commented 4 years ago

There was a discussion in #318 whether to merge in a change that might require adaptions by users of this library. PDFParser follows Semantic Version, which has some implications how to handle that situation. One is to bump major version from 0.x to 1.0. I will outline the arguments I heard so far and wanna invite all of you to vote. Feel free to comment, I will add new arguments to the lists.

Thank you for helping here

:+1: - Yes, jump to 1.0 :-1: - No, keep 0.x for now

Arguments against 1.0

A check to some extent is required to determine if all parts of the library (API) are as we want them to be. The focus has to be on the current feature set and behavior behind. Is there something which is fishy and has to be taken care of before 1.0?
- Based on Github stats there are almost 800 projects directly or indirectly depending on PDFParser. Some may also use this in production environments generating money. PDFParser is an Open Source library with no obligations, but in my opinion we can't just change something and leave developers out in the cold.
- If we just bump the version we acknowledge that the current state is fine. But if we encounter major problems in the near future we might be forced to bump again..
Currently we receive many fixes and some features, for instance from @PaulBehrendtVentoro, @Connum and @izabala.
- That is great! But because of that it is important to know, if its planned to have API changes be part in future contributions.
- As long as we have 0.x, changes in API (or behavior) won't be merged, but in case a (basic) check like in (1.) was conducted, we can collect API changes and put them together into a new 1.x release.

Arguments for 1.0

Semantic Version allows the bump from 0.x to 1.0 at any point. Developers "know" that this might happen and their code might be affected.
People using Composer constraint "^0.x" or stay at 0.x are not affected and shouldn't experience any problems.
Make changes in API or behavior optional and allow developers to enable them via parameter.

Implications here

[ ] prepare a new release
[ ] add an UPGRADE file to inform developers of required code changes

BTW.: The vote may lead our decision to keep #314 or not.

k00ni commented 4 years ago

I am torn after writing the initial post and wanna see how you vote.

Connum commented 4 years ago

This decision is not a light one... But right now, I'm in favour of staying at 0.x.

Here's my current wish list of things I would expect a 1.0 version to support, not necessarily depending on whether those fixes require API changes or not:

[ ] Ability to parse the text in an order that makes sense to human beings, instead of the order in which the text (parts) appear in the raw data (i.e. sort text blocks via coordinates)
[ ] Parse UTF-16 strings correctly (There are currently problems with Japanese and Chinese characters bein partially scrambled) I'm in the process of looking into this, but it seems to be more complicated than I initially thought, and encoding stuff is not my strong suit. But it looks like there might be changes necessary that could have the potential to break current behaviour.
[ ] No memory overflow with larger files or when encountering encoding issues (there are several open issues in that direction) => possible security issue?

And things I'd like to see in a version 1.0, but could be convinced to have added later:

[ ] Right-to-left language support
[ ] Performance improvements for larger PDF files

Reqrefusion commented 4 years ago

First of all, as I thought I did not contribute to this library, I did not find myself sufficient to participate in this voting. However, at #318 I felt compelled to express my opinion at the @k00ni 's kind invitation.

Before I say my opinion on the topic discussed in the title, I would like to say a few words about the voting logic in the title. Arguing, not voting, leads the person to the right conclusion. I especially experienced the truth of this thought on Wikipedia. The arguments in the debate influenced the decision, not the number of votes. A vote that came with a very good argument could affect the result even if all other votes were against it. I also humbly request that this policy be taken into account in this voting.

Our first question is what does 1.0 mean? I think it means the first stable release that accomplishes the goals of the program. So what is the purpose of our Pdfparser? Actually, this place is a bit complicated, but I will go into looking at the definition sentence to keep it simple. Definitions are as follows: PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file. The first thing we understand from this sentence is that the related library is standalone. We provide this situation for now. It also separates PDF files. We provide this, but it cannot be said that we are providing this with determination.

I talked about it first, as I thought 1.0 meant something special. Switching to 1.0 and doing a major update are different things for me. For example, if there was an update for this library that defines the paragraphs and the library was 1.x, I would have said it to switch to 2.0, but if the same was about changing from 0.x to 1.0, I would definitely hesitate. However, such an issue is not even mentioned here. I don't really see a major change, especially if I don't see any change that would cause a transition to 1.0.

Because @Connum knew this library better than I did, he made clearer recommendations. By participating in his suggestions. I declare that the library is not ready for 1.0.

k00ni commented 4 years ago

@Reqrefusion thank you for participating.

The arguments in the debate influenced the decision, not the number of votes. A vote that came with a very good argument could affect the result even if all other votes were against it. I also humbly request that this policy be taken into account in this voting.

You are right, I didn't think about that. What can I do to take that into account?

Reqrefusion commented 4 years ago

Actually, I think this subject is a bit complicated. Because there is no single truth that can be applied in every situation. I personally did not see this idea in discussions of open source projects. I've seen these thoughts in Wikipedia's policy discussions. When choosing a distinguished painting, even if all events were favorable, a connotation containing a wide range of technical knowledge could halt the selection. Although Wikipedia's policies are vulnerable to abuse, I think it is very functional in open source development. Unfortunately, my level of English is not at a level that can explain this topic very well, so I am including the relevant policy page on Wikipedia. https://en.wikipedia.org/wiki/Wikipedia:Polling_is_not_a_substitute_for_discussion

hpvd commented 3 years ago

soorry to late... but

hmm v1.0 with open issues for

multipage text extract not working see https://github.com/smalot/pdfparser/issues/331
random white spaces https://github.com/smalot/pdfparser/issues/396
memory leaks since 2016 https://github.com/smalot/pdfparser/issues/104

?

imho this are all major problems which should block a v1.0

k00ni commented 3 years ago

Let me address your concerns and explain my/our decisions:

(1) We want to support (and test for) PHP 8.0 but also keep development as lightweight as possible. Abandoning a version requires us to increase major version by 1 (we use Semantic Versioning). PHP 5.6 and 7.0 are end of life and not supported anymore. IMHO I would have kept 0.x for a while longer, but we won't risk fatal errors in case someone still uses PHP 5.6 for instance.

(2) PDFParser is in active maintenance with a few outside contributions now and then. Because we lack active development and bug fixing, we have to improve the things at hand IMHO. I personally focus on improving testing environment and help contributors to merge in their PRs. You are right there are a lot of bugs, but this library is also used by hundreds of projects (https://packagist.org/packages/smalot/pdfparser/stats). So PDFParser is useful to a lot of people and doesn't need hide. These bugs must be solved, no discussion here. But who is gonna do it? Until they are solved, we have to live with them and can use the things that already work.

(3) In my opinion a 1.0 is not a big deal for a library which is as mature as PDFParser. Announcing it in a broader way might be better to discuss concerns. Maybe not, I don't know. In March I stated that I think about creating a new major release (https://github.com/smalot/pdfparser/pull/383#issuecomment-795346242). Following it up 16 days ago with more precise information (https://github.com/smalot/pdfparser/pull/383#issuecomment-818551873). Announcing it in this issue also might have been better.

I can only speak for myself. Maybe @smalot or @j0k3r wanna add something here too.

j0k3r commented 3 years ago

I totally agree with @k00ni.

It's not because we are releasing the v1.0.0 that we (or contributors) won't work on fixing bugs. We'll release new patch and minor version in the future to fix bugs and maybe add some features.

hpvd commented 3 years ago

thanks for getting back. Of course a new version can be released on any time only and freely decided by the maintainer or release manager!!

Just two thoughts about it

going from 0.x to 1.0 indicates imho something like going from beta to final. With this one shows the world one is pretty sure that the basics advertised tasks (e.g. text extraction) are working only with minor drawbacks )e.g. like problems with some kinds of malformed pdf exists) As described above this is not really the case...
in general, before every official new release (even on minor version bumps) one should check several standard tasks. Only if these work like a charm, the release is ready for prime time. Otherwise there is a big chance to break tens, hundreds or even thousands of production systems relying on the tool...

j0k3r commented 3 years ago

going from 0.x to 1.0 indicates imho something like going from beta to final

I don't agree. Nothing is final. And I can't find a lot of example of library going from 0.x to 1.x with still a lot of issues opened. Going from 0.x to 1.0 only means the library is ready for prime time and I think releasing 1.0.0 while dropping some PHP versions is a good idea.

in general, before every official new release (even on minor version bumps) one should check several standard tasks. Only if these work like a charm, the release is ready for prime time. Otherwise there is a big chance to break tens, hundreds or even thousands of production systems relying on the tool...

If we follow what you say, no new release will be shipped until all issues are closed. A release is ready for prime time when few bugs are fixed. Don't expect a software or a library to be released when there is no more bugs. Otherwise, you won't be able to using anything on that planet.

We need to move forward. Also, if you are not happy with 1.0.0 while some issues are still a problem for you, we'll be happy to review your PRs which fix them.

k00ni commented 3 years ago

I wanna add some points to @j0k3r comment.

Composer uses Semantic Version therefore it is reasonable for us to do it also. Besides a lot of libraries using it too so its quiet common. Reason for bump were the many problems resulting from using libraries which support all major versions from PHP 5.6 to 8.0 (e.g. PHPUnit). For instance, keeping PHP 5.6 makes no sense here, because it is end of life, bounds spare resources, is slow and there are better ways to do things with PHP 7/8. But when you remove a once supported version you have to bump major or it violates Semantic Version and risks people installing a version on an incompatible system.

I wanna point out that this is a general problem with Open Source software. We don't have required man power to release stable, polished software like it used to be (Its done when its done). We might have too many (serious) bugs for a 1.0 but that's an opinion.

in general, before every official new release (even on minor version bumps) one should check several standard tasks. Only if these work like a charm, the release is ready for prime time. Otherwise there is a big chance to break tens, hundreds or even thousands of production systems relying on the tool...

Yes, in general. I don't wanna use the phrase "something is wrong for you? why don't you help?" because its not helpful. I am for quality over quantity, but it is always depending on the circumstances. Do the best you can with what you have. We didn't promised anything which we didn't deliver, so there is no harm here. No system will break because of our transition from 0.x to 1.0.

hpvd commented 3 years ago

maybe you get me wrong: of course there can be open issues. There can be hundreds of open issues and also bugs on every release.
I'm only talking about major ones which makes it hard or even impossible to use the major features of a tools (named above).

If one do not these quality checks before every single release one will break many things. Issues are not only about open features and optimizations, but also about features working fine in releases of the past.

k00ni commented 3 years ago

If one do not these quality checks before every single release one will break many things. Issues are not only about open features and optimizations, but also about features working fine in releases of the past.

@hpvd I welcome you to help us here. What further quality checks would you use before each release? And how would you "implement" them here?

hpvd commented 3 years ago

@hpvd I welcome you to help us here.

Even if it's not obvious, I already try to help ;-)

The most simple way would be to define 3 pdf files which should always work. They should be of a complexity which is not very low but manageable. Maybe something like the one shown in https://github.com/smalot/pdfparser/issues/416#issuecomment-829056645 can be a good start. Having these files, one setup an additional demo which does not use the latest release but the daily build. Putting these files in the new demo one easily see if the big things are working right.

In a next step of course one can automate this and compare the outputs of the daily build demo which should be the next release with the outputs of the other demo of the last release.
Next step would be to do this comparison on every code change.
Next step would be to add more pdfs, add more complicate pdf etc...

Reqrefusion commented 3 years ago

I have quite a strange feeling about switching to v1.0. As I mentioned before, I see the v1.0 version as a version that fulfills the basic functions of the program without any problems. I agree with @hpvd because of this opinion. However, I understand that v0.20 is not the only way to use the newest version, leaving the older versions of PHP behind. If the program were v1.x, such a change would definitely require v2.0. I wrote about this in more https://github.com/smalot/pdfparser/issues/348#issuecomment-707739037

I don't know if @Connum is around, but I wonder what he thinks about it.

Connum commented 3 years ago

I don't know if @Connum is around, but I wonder what he thinks about it.

Well, the child has already fallen into the well now, as we say in Germany... 😄 1.0.0 it is, we can't move back. Anyway, my two cents: I wouldn't see SemVer so strictly before a version 1.0.0. In my opinion, anything could happen in 0.x versions, including a complete rewrite of an API or dropping a supported version of a dependency or technical system, as in this case. But this is just a gut feeling and how I'd tend to handle this in my projects. I don't know what's the industry standard / best practice when adopting semantic versioning, or if there is anything about this in the specification itself.

So, what I wrote in https://github.com/smalot/pdfparser/issues/348#issuecomment-701261473 still stands: I would have liked to see the big issues with encodings, spaces etc. fixed first, to have everything more stable. But I also understand @kooni's point of view and his wish to stick to SemVer all the way through. I wish I had more time on my hands to debug and contribute more fixes, and I bet other contributors feel the same. But with the current rate of bugfixes, I don't see that happening anytime soon And as he also said, with the library being used as broadly already... After all, it's just a number!

k00ni commented 3 years ago

@Connum and @Reqrefusion Thank you too for your input.

The most simple way would be to define 3 pdf files which should always work. They should be of a complexity which is not very low but manageable. Maybe something like the one shown in #416 (comment) can be a good start. Having these files, one setup an additional demo which does not use the latest release but the daily build. Putting these files in the new demo one easily see if the big things are working right.

@hpvd If I understand you here correctly its already done: We have a number of PDFs which will be used inside tests. These tests run on each code change. Running them more often makes no sense to me, because you would test the same again and again. Am I missing something?

In a next step of course one can automate this and compare the outputs of the daily build demo which should be the next release with the outputs of the other demo of the last release. Next step would be to do this comparison on every code change.

I would like to focus on the commit level and not release level. If a code change breaks tests, a fix or revert is required. A release is just a sum of commits. Our tests will catch these breaks, of course only if our test PDFs are sufficient enough.

Can I state that we already reached the second last step (Next step would be to do this comparison on every code change.)?

Next step would be to add more pdfs, add more complicate pdf etc...

Add more complex PDFs makes totally sense! But what do we do, if PDFParser can't parse them (correctly)? Ignore them or accept the failure in the tests? How can we make sure we improve code but don't break functionality then?

IMHO people involved in this project already know (most of its) flaws/bugs. But this project lacks enough people who are skilled enough to fix these bugs and have time to do it.

Even if it's not obvious, I already try to help ;-)

English is not my mother tongue, its German, so some sentences may appeal other then I intended to. I welcome your input, that's why I stick to this discussion. My intention here is to outline the problems I see (IMHO).

smalot / pdfparser