jpeddicord / askalono

A tool & library to detect open source licenses from texts
Apache License 2.0
255 stars 25 forks source link

WASM demo diff drops character(s)? #62

Open jamin-aws-ospo opened 3 years ago

jamin-aws-ospo commented 3 years ago

In the following diff, the 4's of 2014 (first copyright and second copyright line) are not present in the diff representation on the right:

image

jpeddicord commented 3 years ago

Ah! This one is interesting. It's actually not quite a bug, but it is an oddity (and I'd be open to fixes). A couple things going on:

  1. Internally, askalono tries its best to erase copyright statements so that they don't affect its internal comparison.
  2. The diff you're seeing online isn't quite representative of the diff that askalono has internally.

Putting the second item another way, there are three representations of text here:

A. What you pasted. B. The actual text of the license. C. askalono's post-processed form of the license text.

Internally, it's using texts B and C. But, the website shows a diff between A and B. This was semi-intentional, as form C is quite ugly to look at: it has no newlines, special characters, etc. It's just a long space-separated string of lowercase words.

There's potentially some merit to running some text pre-processing and sending that to the web UI to diff with instead, before the more "destructive" text processors get to it. That functionality doesn't exist, but I'm open to the idea of it.