sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
20 stars 13 forks source link

i #238 fake mbox generator #250

Closed Rubegen closed 1 year ago

Rubegen commented 1 year ago

Rough start to creating fake mbox data

carlosparadis commented 1 year ago

@Rubegen i see an empty file for your PR here, did you forget to commit or not ready yet? Please remember to click request review button. Also if needed we can try making an extra call, but I'd prefer to keep on text unless strictly necessary.

Rubegen commented 1 year ago

Hi Carlos, thanks for pointing this out. That was my mistake for committing the empty file. I went ahead and did another commit just now to show what I have so far to give you an idea of what I've been doing.

codecov[bot] commented 1 year ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (67ff85f) 18.62% compared to head (e558260) 20.32%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #250 +/- ## ========================================== + Coverage 18.62% 20.32% +1.70% ========================================== Files 20 21 +1 Lines 2572 2627 +55 ========================================== + Hits 479 534 +55 Misses 2093 2093 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

carlosparadis commented 1 year ago

Thanks! This is on point 👍 Feel free to ping when you move on to example.R and test-mail.R with some ideas for examples.

carlosparadis commented 1 year ago

@Rubegen Just to be clear: It is fine you only have one unit test, but I do need you to polish the actual function parameters as asked above, otherwise its utility will be severely limited on the number of unit tests we can make. Does that work?

carlosparadis commented 1 year ago

@Rubegen

Your check should pass if you re-compile the documentation and version the generated .Rd for the new function.

However, besides the inline details of your current function which are nearly done, there is one more bigger issue here: https://github.com/sailuh/kaiaulu/pull/250/files#r1398279452

I.e. your example creates one e-mail and one e-mail alone. It is not clear to me how to create multiple e-mail threads, each with different people replying to them. I think you can do with the code you already have, it just needs a bit more thinking on how to make your function call API for the mbox.

Right now, you have a function that doesn't really create_fake_mbox, but rather create_fake_reply() (and if the reply is the first email on that subject, then it happens to be the start of the thread) and returns it as a string. What we need then, is another function that we can pass a list of these replies, and saves them as a .mbox file (because a single .mbox can contain multiple e-mail threads and replies :) ). So I think you're almost there!

Your example function then would call three times create_fake_reply() with say, three e-mails, where two have the same subject, and one has a different subject, assign each e-mail to a variable:


replies <- list()

replies[[1]] <- create_fake_reply(subject="Are we ready for the next release?",...)
replies[[2]] <- create_fake_reply(subject="Re: Are we ready for the next release?",...)
replies[[3]] <- create_fake_reply(subject="Hi, I am new to this mailing list!",...)

create_fake_mbox(replies,save_path="/tmp/example_2_threads_1_reply.mbox")

Note this is very consistent to how we do things on git fake data generator: We never ask the user to do system2 calls on our behalf. Users use Kaiaulu functions to generate the raw data, and these functions hide the details.

Edit: fixed code block above.

carlosparadis commented 1 year ago

If you read the prior message on an e-mail, please see on GitHub as I made a few edits and the e-mails are not sent again.

carlosparadis commented 1 year ago

By the way, don't worry about unit test all 3 replies: Your unit test of one reply is fine. I can write those on my own later. But I do need the API to be able to simulate mailing lists, otherwise we can't unit test much with it.

For instance, the example I gave you on git to test Codeface behavior, was only possible because I could simulate multiple people changing different files and at what time order. If all I could do was one commit and one file, while I could test parse_gitlog() works for 1 commit and 1 person, I would not be able to test transform_temporal_graph(), which creates a graph of multiple people and commits. Regardless, we want to be sure the parser can parse more than one e-mail anyway.

Let me know if anything is not clear.

carlosparadis commented 1 year ago

Your errors should go away once you update git by deleting .Rd for functions that no longer exist and/or add to those of the new functions:

❯ checking for code/documentation mismatches ... WARNING Functions or methods with usage in documentation object 'create_fake_mbox' but not in code: ‘create_fake_mbox’

Functions or methods with usage in documentation object 'example_mbox_normal' but not in code: ‘example_mbox_normal’

carlosparadis commented 1 year ago

@Rubegen After you remove the file() and close() calls, confirm to me you ran this against parse_mbox() and it works. If it does, please hit the "request" review button and I will take from there 👍

I just realized we are not using the folder_name parameter of your example function nor creating a folder (and subsequently deleting), but I can handle that part, no worries.

Rubegen commented 1 year ago

@Rubegen After you remove the file() and close() calls, confirm to me you ran this against parse_mbox() and it works. If it does, please hit the "request" review button and I will take from there 👍

Hi Carlos, just confirming here that I ran all my final functions against parse_mbox() and it worked. The data got parsed into a data table and I was able to view all the replies in the mbox. I'll request review now