Closed khatchad closed 3 years ago
Started reading through your paper, specifically "Step 4: Preprocess big data post set", and I believe the model that we are using does similar pre-processing of data.
From the paper that you first sent me, it seems they are only tokenizing certain texts strings from the comments based on the rules below:
Keeping in accordance with these rules, I've also added a comment pre processor function to address rule # 1,3 and 4. https://github.com/ponder-lab/GitHub-Issue-Mining/blob/main/utils/commentProcessor.py#L42
For rule 2, I'm still looking into how exactly previously quoted comments are identified in the comment string. Will make a note to open an issue for this.
UPDATE: Added this as a comment on the QA/testing issue thread: https://github.com/ponder-lab/GitHub-Issue-Mining/issues/14
To be clear, I am only wondering whether the model you are using does this. I am not (yet) suggesting that we do this. In fact, I don't know if it's necessary for GitHub. It seems necessary for Stack Overflow, but GitHub has a different structure from what I can tell.
I think that this is a quote That continues.
And this Is also
Like this
For rule 2, a quote would be a sequence of lines starting with >
and ending in two newlines.
Do you know if the model you are using does any sanitization of the comments? For clarification, check out step 4 in section 2 of this paper.