fix clean copyright - Githubissues

togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

Apache License 2.0

4.53k stars 346 forks source link

fix clean copyright #29

Open hust-nj opened 1 year ago

hust-nj commented 1 year ago

I think there are 2 main problems in current clean_copyright_comments function https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L27.

First, It cannot remove the copyright successfully in the following C-style code because of the early return in https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L37

// Copyright

int main() {
    return 0;

    /* comment */
}

Second, I find that, when the file is large, the regex sometimes costs much time in my experiment, I think we only need to find the copyright in the first 100 lines.

mauriceweber commented 1 year ago

Hi @hust-nj ! Thanks for bringing this to our attention! I will review your PR asap.

mauriceweber commented 1 year ago

Hi @hust-nj , I had a look at your PR. Here's some feedback:

I would prefer not to limit the search for copyright to the first 100 lines; based on what are you proposing 100 lines?
Your current implementation also gets rid of comments in the beginning of any file, which we would like to keep. For example, this:
```
// A comment
```

int main() { return 0;

/* comment */

}

yields
```C
int main() {
    return 0;

    /* comment */
}