ponder-lab / GitHub-Issue-Classifier

Python script to mine for GitHub issues + comments and classify them.
MIT License
6 stars 0 forks source link

Sanitization quotes in issue comments #22

Closed khatchad closed 3 years ago

khatchad commented 3 years ago

Do you know if the model you are using does any sanitization of the comments? For clarification, check out step 4 in section 2 of this paper.

y3pio commented 3 years ago

Started reading through your paper, specifically "Step 4: Preprocess big data post set", and I believe the model that we are using does similar pre-processing of data.

From the paper that you first sent me, it seems they are only tokenizing certain texts strings from the comments based on the rules below:

  1. Identify embedded source code blocks and replace with CODE token.
  2. Identify quotations made in previous comments and re- place with QUOTE token.
  3. Identify reference links to external resources and replace with URL token.
  4. Identify mentions to GitHub users and replace with SCREEN NAME token.

Keeping in accordance with these rules, I've also added a comment pre processor function to address rule # 1,3 and 4. https://github.com/ponder-lab/GitHub-Issue-Mining/blob/main/utils/commentProcessor.py#L42

For rule 2, I'm still looking into how exactly previously quoted comments are identified in the comment string. Will make a note to open an issue for this.

UPDATE: Added this as a comment on the QA/testing issue thread: https://github.com/ponder-lab/GitHub-Issue-Mining/issues/14

khatchad commented 3 years ago

To be clear, I am only wondering whether the model you are using does this. I am not (yet) suggesting that we do this. In fact, I don't know if it's necessary for GitHub. It seems necessary for Stack Overflow, but GitHub has a different structure from what I can tell.

khatchad commented 3 years ago

I think that this is a quote That continues.

khatchad commented 3 years ago

And this Is also

khatchad commented 3 years ago

Like this

khatchad commented 3 years ago

For rule 2, a quote would be a sequence of lines starting with > and ending in two newlines.