togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

fix clean copyright #29

Open hust-nj opened 1 year ago

hust-nj commented 1 year ago

I think there are 2 main problems in current clean_copyright_comments function https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L27.

First, It cannot remove the copyright successfully in the following C-style code because of the early return in https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L37

// Copyright

int main() {
    return 0;

    /* comment */
}

Second, I find that, when the file is large, the regex sometimes costs much time in my experiment, I think we only need to find the copyright in the first 100 lines.

mauriceweber commented 1 year ago

Hi @hust-nj ! Thanks for bringing this to our attention! I will review your PR asap.

mauriceweber commented 1 year ago

Hi @hust-nj , I had a look at your PR. Here's some feedback:

int main() { return 0;

/* comment */

}

yields
```C
int main() {
    return 0;

    /* comment */
}