jpeddicord / askalono

A tool & library to detect open source licenses from texts
Apache License 2.0
256 stars 25 forks source link

Implemented largest substring removal #32

Closed JoshBrudnak closed 5 years ago

JoshBrudnak commented 5 years ago

Resolves #3

Added a step to the preprocessing to remove the largest substring from a license. I also added testing of the function.

JoshBrudnak commented 5 years ago

@jpeddicord It is able to find the largest substr and remove it from in all occurences except for the first occurance. Here's an example ->

Before largest substr removal:

%%Copyright: Copyright                                                           
%%Copyright: All rights reserved.                                                        
%%Copyright: Redistribution and use in source and binary forms, with or                  
%%Copyright: without modification, are permitted provided that the                       
%%Copyright: following conditions are met:

After:

%%Copyright: Copyright
All rights reserved.
Redistribution and use in source and binary forms, with or
without modification, are permitted provided that the
following conditions are met:

Do you know what might be causing this?

jpeddicord commented 5 years ago

Thanks for the PR! I forgot to take a look at this yesterday; did you figure that out in your latest commit?

Looks like a test is failing:

https://travis-ci.org/amzn/askalono/jobs/440452158#L694

Not entirely sure why, but I'll take a look.

jpeddicord commented 5 years ago

Also: can you confirm that your changes are being submitted under the terms of the Apache 2.0 license?

JoshBrudnak commented 5 years ago

Yes, I simply forgot to trim the string before the comparison :face_with_head_bandage: . I did make the changes under the terms of the Apache 2.0 license.

jpeddicord commented 5 years ago

Sorry for my slow response; I have not forgotten about this! I'm digging into the failing test now.

jpeddicord commented 5 years ago

OK. I'm not 100% sure why that test specifically fails in the way it does, but I think this can be traced back to a missing case in this greatest substring removal: The code will remove the longest substring found anywhere on a given line, which is likely too aggressive. Ideally it'd only remove them from the beginning of lines (and only look for longest substrings starting from the beginning).

Here's some asserts you can add to your test case to show what I mean. To see the printlns be sure to run your test as cargo test -- --nocapture to see the output:

        let text = "this string should still have\nthis word -> this <- in it even though\nthis is still the most common word";
        let new_text = remove_common_tokens(text);
        println!("-- {}", new_text);
        // the "this" at the start of the line can be discarded...
        assert!(!new_text.contains("\nthis"));
        // ...but the "this" in the middle of sentences shouldn't be
        assert!(new_text.contains("this"));

        let text = "aaaa bbbb cccc dddd
        eeee ffff aaaa gggg
        hhhh iiii jjjj";
        let new_text = remove_common_tokens(text);
        println!("-- {}", new_text);
        assert!(new_text.contains("aaaa")); // similar to above test

Another good test to add (I don't know if the code is doing this or not) is one to check that the removal of common prefixes only happens in a group of lines at a time. For example, if a file looked something like this:

//LICENSE this is the
//LICENSE license text of
//LICENSE this file
//LICENSE copyright me

fn my_code {
   [...]
   LICENSE = 123;
}

The "LICENSE" later on in the file shouldn't get removed unless it was in a group of other "LICENSE"-prefixed lines as well. I think your code accounts for this by only comparing two lines at a time, but it'd be good to have a test to make this solid as well.

JoshBrudnak commented 5 years ago

I refactored the LCS to remove common substrings only if they appear at the beginning of a line. This seems to have fixed the failing test. I also added the extra tests that you provided. Thanks :100: .

jpeddicord commented 5 years ago

This looks great! Thanks for fixing that up. I might tinker with the order of the substring removal down the line (it might be more efficient to put it in the "normalize" list instead of "aggressive", so it'll get more heavily cached) but as it is this'll work just fine.