jpeddicord / askalono

A tool & library to detect open source licenses from texts
Apache License 2.0
255 stars 25 forks source link

Simplify `lcs_substr` and `remove_common_tokens` #46

Closed AnthonyMikh closed 4 years ago

AnthonyMikh commented 5 years ago

Description of changes: The code of lcs_substr and remove_common_tokens was changed in order to make them easier to understand (at least, I hope so). No attempt was made to address performance issues.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

jpeddicord commented 4 years ago

Yikes! I missed this, sorry. Feel free to poke me if things sit too long. I think this is a great change, though do note that the lcs bits are currently disabled due to a bug uncovered in their implementation. I can't tell if this fixes it, but I'm happy to merge this in any case as it greatly simplifies things.

Nice work!

AnthonyMikh commented 4 years ago

Thanks, it's nice to hear.

I want to note that I haven't make any attempt to fix the transformation, I've just kept the behaviour as is (tests run successfuly). Nevertheless, some parts seem suspicious for me:

  1. The original code of remove_common_tokens (as well as mine) goes through consecutive pairs of lines. They do not interleave, so calculated lcs_substr doesn't take some line relations into account.
  2. When the code makes the result, it looks at the length of largest substring (or should it be rather called "largest common prefix"?) and if it is large enough, function returns only the lines that starts with largest substring, otherwise the function unconditionally returns all the lines of text.

I am not sure if these moments have something to do with #42, but they seem like the opportunity to make an error. Feel free to prove me wrong.