sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
19 stars 12 forks source link

Perceval version 0.21.2-rc.4 returns error on parse_mbox(), perceval_parsed is empty in Social Smells Notebook #145

Closed leilani-reich closed 1 year ago

leilani-reich commented 1 year ago

Problem:

While running social_smell_showcase.Rmd on the Apache Thrift github repository (https://github.com/apache/thrift), on line 139

https://github.com/sailuh/kaiaulu/blob/f66bc0f964861b8c928cda01c6b4b493414bf13a/vignettes/social_smell_showcase.Rmd#L139

I got the following error:

Error in data.table::setnames(perceval_parsed, "data.body", "reply_body") : Items of 'old' not found in column names: [data.body]. Consider skip_absent=TRUE.

Steps I took to try and fix it:

-Checked my perceval_path and mbox_path and made sure they were correct and relative to my working directory -Checked my perceval installation & tried different installation methods -Used a different mbox file. First I tried the one provided for the Apache Thrift project by Rick. Then, I used the mbox here https://lists.apache.org/list.html?dev@thrift.apache.org. -Tried running the notebook on another project. In particular, I tried to run the notebook for the Apache Helix github repo (https://github.com/apache/helix), but that didn't seem to make a difference.

My findings:

I looked at the source of the parse_mbox() function in parser.R, and I tried to learn more about the error.

https://github.com/sailuh/kaiaulu/blob/f66bc0f964861b8c928cda01c6b4b493414bf13a/R/parser.R#L243-L281)

One thing I noticed was that if I set stderr=TRUE in line 250-253 of this function

perceval_output <- system2(perceval_path, args = c('mbox',mbox_uri,mbox_path,'--json-line'), stdout = TRUE, stderr = TRUE) # set to true

Then I get the following error message:

Error: parse error: after array element, I expect ',' or ']' [2023-01-26 22:19:07,492] - Sir Perce (right here) ------^

I know the "[2023-01-26 22:19:07,492]" seems to refer to the time I ran the notebook. I believe "Sir Perce" refers to perceval, but I don't understand why perceval is giving this error.

Also if I set verbose=TRUE in line 255 of this function:

line 255) perceval_parsed <- data.table(jsonlite::stream_in(textConnection(perceval_output),verbose=TRUE)) # changed verbose to TRUE

The console output says "Imported 0 records. Simplifying...".

So perceval_parsed is an empty table, which I believe explains why I was getting this error from before:

Error in data.table::setnames(perceval_parsed, "data.body", "reply_body") : Items of 'old' not found in column names: [data.body]. Consider skip_absent=TRUE.

Referencing this documentation for the setnames() function, https://rdrr.io/rforge/CALIBERdatamanage/man/setnames.html, I see that "old" represents the old columns of the dataframe, and since the dataframe was empty, "old" wasn't found.

The problem I have is I don't understand why perceval_parsed ends up being empty.

To replicate error:

-clone Kaiaulu and install Kaiaulu package -clone Apache Thrift repo (https://github.com/apache/thrift) -set up the project configuration file for thrift.yml following the example here (https://github.com/sailuh/kaiaulu/blob/master/conf/helix.yml). Specifically, I just set up the version control and mailing list section -Within, social_smell_showcase.Rmd, change tools_path and conf_path (here: https://github.com/sailuh/kaiaulu/blob/f66bc0f964861b8c928cda01c6b4b493414bf13a/vignettes/social_smell_showcase.Rmd#L58-L59) -run the social smells notebook

carlosparadis commented 1 year ago

Hi Leilani,

Could you try running the command for Perceval directly on terminal? i.e.

perceval_output <- system2(perceval_path,
args = c('mbox',mbox_uri,mbox_path,'--json-line'),
stdout = TRUE,
stderr = TRUE) # set to true

Should be on terminal:

/path/to/perceval mbox <mbox_uri> <mbox_path> --json-line

Where and are the values of the variables you would just paste directly to the terminal.

What this code line is doing is really just pasting it on the terminal on your behalf, and then ingesting the json back into R. If on the terminal you also get an empty table, then the issue lies within Perceval itself or the data, instead of Kaiaulu. This will also help us see if Perceval outputs any other error that may help us diagnose it.

carlosparadis commented 1 year ago

Also, could you attach the thrift config file you made here so I could check?

leilani-reich commented 1 year ago

Hi Carlos,

I seem to be getting an empty table again. I believe it's an error with Perceval because I tried the command on different mbox files, but I get the same issue.

This is the issue: (my_env) (base) leilanis-mbp-5:kaiaulu xylent1$ ~/github/ics496/kaiaulu/my_env/bin/perceval mbox ../thrift-project/thrift-dev ../thrift-project/thrift-dev.mbox --json-line [2023-01-27 21:31:41,849] - Sir Perceval is on his quest.thrift [2023-01-27 21:31:41,851] - Looking for messages from '../thrift-project/thrift-dev' on '../thrift-project/thrift-dev.mbox' since 1970-01-01 00:00:00+00:00 until None [2023-01-27 21:31:41,851] - Error!: <<class 'NoneType'>> object is not a valid date [2023-01-27 21:31:41,851] - Sir Perceval completed his quest.

I've attached my thrift project configuration file. I copied the helix.yml file in Kaiaulu and made small changes for thrift.yml. In particular, I have only changed the version_control and mailing_list sections so far.

Is there a certain version of Perceval I need for Kaiaulu? The last one I tried was 0.21.2-rc.4, the newest version.

Note: Attached file is a .txt instead of .yml because .yml wasn't allowed.

thrift.txt

Also, here's a link to download the mbox I used: https://cdn.lfdr.de/stmc/ieee_tse_data/mail/thrift-dev.mbox

carlosparadis commented 1 year ago

My perceval version is 0.12.24 if you would like to try and see if that is the issue.

Try to also just "cd" to the folder where your thift-dev.mbox is, and try to run perceval there, e.g.:

cd /Users/cvp/Desktop/sailuh/rawdata/mbox
/Users/cvp/perceval/bin/perceval mbox thrift-dev thrift-dev.mbox --json-line

I'd recommend you follow the folder structure used in Kaiaulu, just to make it easier for us to reference each other's file path. More specifically, all Kaiaulu config files assume you have a folder structure of the following format:

/path/to/sailuh/kaiaulu /path/to/sailuh/rawdata/git_repo/thrift.git /path/to/sailuh/rawdata/mbox/thrift-dev.mbox

In this organization, the config file, specify the mbox path to:

../rawdata/mbox/thrift-dev.mbox

If you are running from the terminal or

../../rawdata/mbox/thrift-dev.mbox

If you are trying to compile the vignette.

Sometimes the relative path an be misleading when using in R Studio. If you compile the notebook, it will assume you are within /vignettes. If you execute from the terminal section, it usually assumes you are on kaiaulu folder.

Let me know if you still can't get it to run even running my version of Perceval or trying the above.

leilani-reich commented 1 year ago

Hi Carlos, I tried version 0.12.24 of perceval and that fixed the issue. I can see a filled table for project_mbox now. I will keep the folder structure advice in mind. Thank you!

carlosparadis commented 1 year ago

@leilani-reich Could you also open an issue on Perceval with the problem and link the issue here? It should suffice to provide them with the .mbox file URL, and the command you tried to execute on the terminal. It would be helpful to know from them what is causing the issue, otherwise in the future, we may be unable to maintain this integration with Perceval.

On our part, I will note the Perceval version to the README, so thank you for spotting this :)

leilani-reich commented 1 year ago

You're welcome! Issue is posted here: https://github.com/chaoss/grimoirelab-perceval/issues/810

leilani-reich commented 1 year ago

Hi Carlos, aside from using Perceval version 0.12.24 for Kaiaulu, we can implement the advice from the related grimoirelab-perceval issue into code within the parse_mbox() function directly so parse_mbox() will run smoothly for a newer version of Perceval.

In particular, we can change the following in parser.r (works with Perceval version 0.12.24 but not 0.21.3):

https://github.com/sailuh/kaiaulu/blob/f66bc0f964861b8c928cda01c6b4b493414bf13a/R/parser.R#L250-L253

To this (works for newest Perceval version 0.21.3 but doesn't work with 0.12.24):

  perceval_output <- system2(perceval_path,
                             args = c('mbox',mbox_uri,mbox_path,'--json-line','--to-date','2100-01-01'),
                             stdout = TRUE,
                             stderr = FALSE)

Is this something you think would be beneficial to change, or would it be best to just stick with Perceval version 0.12.24 for Kaiaulu?

Thanks, Leilani

carlosparadis commented 1 year ago

Hi Leilani,

Since the solution provided was a temporal fix, I suggest we stick to 0.12.24 until the fix is implemented on their end.

leilani-reich commented 1 year ago

That makes sense. Thank you! I will look out for updates on this.

lh-zhan commented 1 year ago

Hi @carlosparadis ! I've followed all of the instructions related to this issue, made sure I have the 0.12.24 version, tested out perceval from the vignettes/ directory and it works just fine.

The comment that I used to test out @mac-groundapple vignettes % /Users/Desktop/kaiaulu/env/bin/perceval mbox dev_helix_apache_org ../../rawdata/mbox/dev_helix_apache_org.mbox It ran without any error and gave me the desired result. I also made sure my perceval path specified in tool.yml aligns with the perceval path that I used above

However, when I try to execute the .Rmd file, I got Error in data.table::setnames(perceval_parsed, c("data.Date", "data.To", : Items of 'old' not found in column names: [data.Cc, data.In.Reply.To]. Consider skip_absent=TRUE.

My set up in helix.yml is ` mbox: ../../rawdata/mbox/dev_helix_apache_org.mbox domain: http://mail-archives.apache.org/mod_mbox list_key:

I'm running the .rmd file inside of Rstudio. My file structure looks like this: /Desktop/kaiaulu/vignettes /Desktop/rawdata/mbox/dev_helix_apache_org.mbox

Any help is appreciated!

carlosparadis commented 1 year ago

@lh-zhan Hi Zhan! Thank you for the interest in the tool :)

This (misleading) error Kaiaulu generates is usually associated with the .mbox file not being found (similar to #108). Or it could also be you are trying to execute on an .mbox archive that doesn't have the columns Kaiaulu tries to rename.

I am assuming parse_mbox() is the function giving you trouble. So we could narrow the possibilities, could you try running the parse_mbox() function inline?

In essence, execute the Notebook up to and before the parse_mbox() function. In. another tab, open the file R/parser.R and locate this function:

https://github.com/sailuh/kaiaulu/blob/716a46e546d5293f3b2856cace62311c95aae977/R/parser.R#L247

Then, load the parameters the function receive, and simply run line by line the code in the function until you encounter the error.

For example, check if the loaded object perceval_parsed contains a table or is an empty variable. If the variable is empty then it is a path problem.

Otherwise, your error likely lies in this region:

https://github.com/sailuh/kaiaulu/blob/716a46e546d5293f3b2856cace62311c95aae977/R/parser.R#L261-L273

You can then check, if for example, Kaiaulu is trying to rename columns that don't exist, by typing:

colnames(perceval_parsed)

to see if the columns in the data don't exist.

Let me know if this helps narrow down the issue, otherwise we can iterate here. Thanks!

lh-zhan commented 1 year ago

Hi @carlosparadis

Thank you for your swift reply! I was able to find out that the perceval_parsed is empty by following your instruction. You mentioned empty variable is caused by path issue, therefore, I tried modifying my path to mbox but no luck..

Paths that I've tried: ~/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org.mbox ../rawdata/mbox/dev_helix_apache_org.mbox ../../rawdata/mbox/dev_helix_apache_org.mbox

Can this issue be potentially caused by domain or list_key? I have: domain: https://lists.apache.org/list.html?dev@helix.apache.org ` list_key:

Thanks!!

carlosparadis commented 1 year ago

@lh-zhan Fantastic! Glad we could narrow down the problem.

So our issue is narrowed down to:

https://github.com/sailuh/kaiaulu/blob/716a46e546d5293f3b2856cace62311c95aae977/R/parser.R#L254-L257

Could you try using the full path starting on root for mbox_path? Technically path.expand()

https://github.com/sailuh/kaiaulu/blob/716a46e546d5293f3b2856cace62311c95aae977/R/parser.R#L249-L250

should be handling that, but trying manually would be interesting to see.

system2 is a base R function, so what we need to figure out is why R is not being able to find the file, since it apparently is finding the perceval path from tools.yml on your computer but not just the file.

Could you also experiment running with the flags values to TRUE on stderr? (e.g. try stdout = FALSE, and stderr = TRUE to see if it prints the terminal error).

https://github.com/sailuh/kaiaulu/blob/716a46e546d5293f3b2856cace62311c95aae977/R/parser.R#L254-L257

That may give us further insight.

lh-zhan commented 1 year ago

Hey @carlosparadis

Appreciate the instruction, I think we are one step closer!

I tried using the full path on root for mbox, and run parse_mbox function line by line. Interestingly, https://github.com/sailuh/kaiaulu/blob/716a46e546d5293f3b2856cace62311c95aae977/R/parser.R#L254 this line didn't throw any error, I printed out everything in console and they are identical as I run perceval in my terminal.

But the next line https://github.com/sailuh/kaiaulu/blob/716a46e546d5293f3b2856cace62311c95aae977/R/parser.R#L259

Gave me

Error: parse error: after array element, I expect ',' or ']'
                                 [2023-04-08 22:30:58,822] - Sir Perce
                     (right here) ------^

I think we've tracked down the root of the issue here, seems like a parsing format error here but sadly I'm unfamiliar with the data.table function used here.

carlosparadis commented 1 year ago

Would you be able to send me the output generated by:

jsonlite::stream_in(textConnection(perceval_output),verbose=FALSE)

e.g.

perceval_output <- jsonlite::stream_in(textConnection(perceval_output),verbose=FALSE)
jsonlite::write_json(some_filepath,perceval_output)

Maybe via hyperlink to a google drive or dropbox? I want to try and replicate it on my end.

From the error, it seems the problem is that data.table doesn't know what [2023-04-08 22:30:58,822] means as a date.

This could be addressed by:

perceval_output <- jsonlite::stream_in(textConnection(perceval_output),verbose=FALSE)
perceval_parsed <- ....... # More careful field parse happens here 

As for data.table, it is a more optimized library implemented in C/C++ than R base data.frame. I am curious about why you are encountering this error, however, since you are also using a version of Perceval 3 others are using, and have not encountered this error. Hence, the data request so I could compare the files.

lh-zhan commented 1 year ago

Sorry about the late reply, I didn't get a chance to check my emails today until now(EST time :( )

I got an Error in file(con, "w") : invalid 'description' argument but I found another way to store the json object :)

I've placed the file at https://drive.google.com/file/d/1_tCDsakklY8piqwPgBG23Fb1FGjINe2P/view?usp=sharing

It does seem like the JSON object has syntax error as when I attempt to open it in Firefox it gives: SyntaxError: JSON.parse: expected ',' or ']' after array element at line 1 column 6 of the JSON data

You can view it as Raw Data to view the entire message.

Thanks!

carlosparadis commented 1 year ago

@lh-zhan

No worries! Is this the data you have been getting all along? Because if so, the problem is not with Kaiaulu, but just the fact the Perceval command is not executing correctly. Perceval can sometimes execute and create a file, even if it can't find the data. However, if you look at the data file generated you will see it failed:

[2023-04-09 23:22:05,955] - Sir Perceval is on his quest.
[2023-04-09 23:22:05,955] - Looking for messages from '/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org' on '/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org.mbox' since 1970-01-01 00:00:00+00:00
[2023-04-09 23:22:05,957] - Done. 1/1 messages fetched; 0 ignored
[2023-04-09 23:22:05,957] - Fetch process completed
[2023-04-09 23:22:05,957] - Summary of results

       Total items:     1
    Items produced:     1
     Items skipped:     0

    Last item UUID:     0a91037c482d476b4fdfd5f35d3e9c9534eebcba
    Last item date:     2023-03-29 12:42:18+00:00

    Min. item date:     2023-03-29 12:42:18+00:00
    Max. item date:     2023-03-29 12:42:18+00:00

    Min. offset:    -   Max. offset:    -   Last offset:    -

[2023-04-09 23:22:05,957] - Sir Perceval completed his quest.
{"backend_name":"MBox","backend_version":"0.12.0","category":"message","classified_fields_filtered":null,"data":{"Authentication-Results":"apache.org; auth=none","Content-Transfer-Encoding":"quoted-printable","Content-Type":"text/plain; charset=\"UTF-8\"","Date":"Wed, 29 Mar 2023 08:42:18 -0400","Delivered-To":"apmail-dev-all@apache.org","From":"Rich Bowen <rbowen@apache.org>","List-Help":"<mailto:dev-help@helix.apache.org>","List-Id":"<dev.helix.apache.org>","List-Post":"<mailto:dev@helix.apache.org>","List-Unsubscribe":"<mailto:dev-unsubscribe@helix.apache.org>","MIME-Version":"1.0","Mailing-List":"contact dev-help@helix.apache.org; run by ezmlm","Message-ID":"<6335f4359cbfdb7bdd42ed1c434a9077e59af2ed.camel@apache.org>","Organization":"The Apache Software Foundation","Precedence":"bulk","Received":"from [192.168.21.205] (unknown [52.95.4.13])\n\tby mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPSA id 8898A3EE47;\n\tWed, 29 Mar 2023 12:42:19 +0000 (UTC)","Reply-To":"dev@helix.apache.org","Return-Path":"<dev-return-7519-archive-asf-public=cust-asf.ponee.io@helix.apache.org>","Subject":"A Message from the Board to PMC members","To":"\"board@apache.org\" <board@apache.org>","User-Agent":"Evolution 3.46.4 (3.46.4-1.fc37) ","X-Original-To":"archive-asf-public@cust-asf.ponee.io","body":{"plain":"Dear Apache Project Management Committee (PMC) members,\n\nThe Board wants to take just a moment of your time to communicate a few\nthings that seem to have been forgotten by a number of PMC members,\nacross the Foundation, over the past few years.  Please note that this\nis being sent to all projects - yours has not been singled out.\n\nThe Project Management Committee (PMC) as a whole[1] is tasked with the\noversight, health, and sustainability of the project. The PMC members\nare responsible collectively, and individually, for ensuring that the\nproject operates in a way that is in line with ASF philosophy, and in a\nway that serves the developers and users of the project.\n\nThe PMC Chair is not the project leader, in any sense. It is the person\nwho files board reports and makes sure they are delivered on time. It\nis the secretary for the project, and the project\u2019s  ambassador to the\nBoard of Directors. The VP title is given as an artifact of US\ncorporate law, and not because the PMC Chair has any special powers. If\nyou are treating your PMC Chair as the project lead, or granting them\nany other special powers or privileges, you need to be aware that\nthat\u2019s not the intent of the Chair role. The Chair is a PMC member peer\nwith a few extra duties.\n\nEvery PMC member has an equal voice in deliberations. Each has one\nvote. Each has veto power. Every vote weighs the same. It is not only\nyour right, but it is your obligation, to use that vote for the good of\nthe project and its users, not to appease the Chair, your employer, or\nany other voice in the project. \n\nEvery PMC member can, and should, nominate new committers, and new PMC\nmembers. This is not the sole domain of the PMC Chair. This might be\nyour most important responsibility to the project, as succession\nplanning is the path to sustainability.\n\nEvery PMC member can, and should, respond when the Board sends email to\nyour private list. You should not wait for the PMC Chair to respond.\nThe Board views the entire PMC as responsible for the project, not just\none member.\n\nEvery PMC member should be subscribed to the private@ mailing list. If\nyou are not, then you are neglecting your duty of oversight. If you no\nlonger wish to be responsible for oversight of the project, you should\nresign your PMC seat, not merely drop off of the private@ list and\nignore it. You can determine which PMC members are not subscribed to\nyour private list by looking at your PMC roster at\nhttps://whimsy.apache.org/roster/committee/  Names with an asterisk (*)\nnext to them are not subscribed to the list. We encourage you to take a\nmoment to contact them with this information.\n\nThank you for your attention to these matters, and thank you for\nkeeping our projects healthy.\n\nRich, for The Board of Directors\n\n[1] https://apache.org/foundation/how-it-works.html#pmc-members\n\n"},"unixfrom":"dev-return-7519-archive-asf-public=cust-asf.ponee.io@helix.apache.org  Wed Mar 29 12:48:18 2023"},"origin":"/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org","perceval_version":"0.12.24","search_fields":{"item_id":"<6335f4359cbfdb7bdd42ed1c434a9077e59af2ed.camel@apache.org>"},"tag":"/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org","timestamp":1681096925.95702,"updated_on":1680093738.0,"uuid":"0a91037c482d476b4fdfd5f35d3e9c9534eebcba"}

Was this also what you get when executing Perceval via the terminal? If so, you should try testing it again via Terminal that the parameters passed work. Otherwise, if that was not the intended file, let me know.

I should have asked this in advance: What OS are you using? OS X?

Also, I do not believe the above file is a JSON file. So, it makes sense any json parser would fail too.

lh-zhan commented 1 year ago

Hey! I'm on macOS Monterey.

Interestingly, when I run the command from terminal, it returns a valid and parseable json object as the following:

lzhan@mac-groundapple ~ % /Users/lzhan/Library/Python/3.11/bin/perceval mbox dev_helix_apache_org /Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org.mbox
[2023-04-10 10:54:36,909] - Sir Perceval is on his quest.
[2023-04-10 10:54:36,909] - Looking for messages from 'dev_helix_apache_org' on '/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org.mbox' since 1970-01-01 00:00:00+00:00
{
    "backend_name": "MBox",
    "backend_version": "0.12.0",
    "category": "message",
    "classified_fields_filtered": null,
    "data": {
        "Authentication-Results": "apache.org; auth=none",
        "Content-Transfer-Encoding": "quoted-printable",
        "Content-Type": "text/plain; charset=\"UTF-8\"",
        "Date": "Wed, 29 Mar 2023 08:42:18 -0400",
        "Delivered-To": "apmail-dev-all@apache.org",
        "From": "Rich Bowen <rbowen@apache.org>",
        "List-Help": "<mailto:dev-help@helix.apache.org>",
        "List-Id": "<dev.helix.apache.org>",
        "List-Post": "<mailto:dev@helix.apache.org>",
        "List-Unsubscribe": "<mailto:dev-unsubscribe@helix.apache.org>",
        "MIME-Version": "1.0",
        "Mailing-List": "contact dev-help@helix.apache.org; run by ezmlm",
        "Message-ID": "<6335f4359cbfdb7bdd42ed1c434a9077e59af2ed.camel@apache.org>",
        "Organization": "The Apache Software Foundation",
        "Precedence": "bulk",
        "Received": "from [192.168.21.205] (unknown [52.95.4.13])\n\tby mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPSA id 8898A3EE47;\n\tWed, 29 Mar 2023 12:42:19 +0000 (UTC)",
        "Reply-To": "dev@helix.apache.org",
        "Return-Path": "<dev-return-7519-archive-asf-public=cust-asf.ponee.io@helix.apache.org>",
        "Subject": "A Message from the Board to PMC members",
        "To": "\"board@apache.org\" <board@apache.org>",
        "User-Agent": "Evolution 3.46.4 (3.46.4-1.fc37) ",
        "X-Original-To": "archive-asf-public@cust-asf.ponee.io",
        "body": {
            "plain": "Dear Apache Project Management Committee (PMC) members,\n\nThe Board wants to take just a moment of your time to communicate a few\nthings that seem to have been forgotten by a number of PMC members,\nacross the Foundation, over the past few years.  Please note that this\nis being sent to all projects - yours has not been singled out.\n\nThe Project Management Committee (PMC) as a whole[1] is tasked with the\noversight, health, and sustainability of the project. The PMC members\nare responsible collectively, and individually, for ensuring that the\nproject operates in a way that is in line with ASF philosophy, and in a\nway that serves the developers and users of the project.\n\nThe PMC Chair is not the project leader, in any sense. It is the person\nwho files board reports and makes sure they are delivered on time. It\nis the secretary for the project, and the project\u2019s  ambassador to the\nBoard of Directors. The VP title is given as an artifact of US\ncorporate law, and not because the PMC Chair has any special powers. If\nyou are treating your PMC Chair as the project lead, or granting them\nany other special powers or privileges, you need to be aware that\nthat\u2019s not the intent of the Chair role. The Chair is a PMC member peer\nwith a few extra duties.\n\nEvery PMC member has an equal voice in deliberations. Each has one\nvote. Each has veto power. Every vote weighs the same. It is not only\nyour right, but it is your obligation, to use that vote for the good of\nthe project and its users, not to appease the Chair, your employer, or\nany other voice in the project. \n\nEvery PMC member can, and should, nominate new committers, and new PMC\nmembers. This is not the sole domain of the PMC Chair. This might be\nyour most important responsibility to the project, as succession\nplanning is the path to sustainability.\n\nEvery PMC member can, and should, respond when the Board sends email to\nyour private list. You should not wait for the PMC Chair to respond.\nThe Board views the entire PMC as responsible for the project, not just\none member.\n\nEvery PMC member should be subscribed to the private@ mailing list. If\nyou are not, then you are neglecting your duty of oversight. If you no\nlonger wish to be responsible for oversight of the project, you should\nresign your PMC seat, not merely drop off of the private@ list and\nignore it. You can determine which PMC members are not subscribed to\nyour private list by looking at your PMC roster at\nhttps://whimsy.apache.org/roster/committee/  Names with an asterisk (*)\nnext to them are not subscribed to the list. We encourage you to take a\nmoment to contact them with this information.\n\nThank you for your attention to these matters, and thank you for\nkeeping our projects healthy.\n\nRich, for The Board of Directors\n\n[1] https://apache.org/foundation/how-it-works.html#pmc-members\n\n"
        },
        "unixfrom": "dev-return-7519-archive-asf-public=cust-asf.ponee.io@helix.apache.org  Wed Mar 29 12:48:18 2023"
    },
    "origin": "dev_helix_apache_org",
    "perceval_version": "0.12.24",
    "search_fields": {
        "item_id": "<6335f4359cbfdb7bdd42ed1c434a9077e59af2ed.camel@apache.org>"
    },
    "tag": "dev_helix_apache_org",
    "timestamp": 1681138476.910926,
    "updated_on": 1680093738.0,
    "uuid": "b8de6197adf2212d6ae26238d2c2dd69f081e2b0"
}
[2023-04-10 10:54:36,911] - Done. 1/1 messages fetched; 0 ignored
[2023-04-10 10:54:36,911] - Fetch process completed
[2023-04-10 10:54:36,911] - Summary of results

       Total items:     1
    Items produced:     1
     Items skipped:     0

    Last item UUID:     b8de6197adf2212d6ae26238d2c2dd69f081e2b0
    Last item date:     2023-03-29 12:42:18+00:00

    Min. item date:     2023-03-29 12:42:18+00:00
    Max. item date:     2023-03-29 12:42:18+00:00

    Min. offset:    -   Max. offset:    -   Last offset:    -

[2023-04-10 10:54:36,911] - Sir Perceval completed his quest.

Taking another look at the JSON file that was saved, all the outputs were stored upside down. By that I meant,

[2023-04-10 10:44:30,236] - Sir Perceval is on his quest.
[2023-04-10 10:44:30,237] - Looking for messages from 'dev_helix_apache_org' on '/Users/lzhan/Desktop/rawdata/mbox/dev_helix_apache_org.mbox' since 1970-01-01 00:00:00+00:00
[2023-04-10 10:44:30,239] - Done. 1/1 messages fetched; 0 ignored
[2023-04-10 10:44:30,239] - Fetch process completed
[2023-04-10 10:44:30,239] - Summary of results

       Total items:     1
    Items produced:     1
     Items skipped:     0

    Last item UUID:     b8de6197adf2212d6ae26238d2c2dd69f081e2b0
    Last item date:     2023-03-29 12:42:18+00:00

    Min. item date:     2023-03-29 12:42:18+00:00
    Max. item date:     2023-03-29 12:42:18+00:00

    Min. offset:    -   Max. offset:    -   Last offset:    -

[2023-04-10 10:44:30,239] - Sir Perceval completed his quest.

This summary got stored at the beginning of the file as opposed when we execute the same command in the terminal, the above summary will be displayed at the end.

And yes, you are correct on the file is not a parseable JSON object, I think what happened was, all data was condensed into one large paragraph which led to an unparseable JSON.

carlosparadis commented 1 year ago

I just tested on my end parse_mbox() and i got the following parameters:

perceval_path = "/Users/cvp/perceval/bin/perceval" mbox_path = "../../rawdata/mbox/thrift-dev.mbox" mbox_uri = "../../rawdata/mbox/thrift-dev"

This is how the start of the file should look like:

["{\"backend_name\":\"MBox\",\"backend_version\":\"0.12.0\",\"category\":\"message\",\"classified_fields_filtered\":null,\"data\":{\"Content-Disposition\":\"inline\",\"Content-Type

The file is about 316.1MB, and it should be usable by jsonlite::write_json() too from perceval_output.

Note that, when you are using RStudio knitr to compile the notebook, and when you are using the Terminal in RStudio your current path differs.

The project configuration file assumes your current path is in kaiaulu/vignettes/notebook.Rmd, and hence ../../path/to/mbox. However, if you are trying to run commands on Rstudio Terminal, RStudio default path will likely be kaiaulu. In which case, your perceval_path should be adjusted to just ../path/to/mbox.

Failing to do so, will result in no files being parsed due to the file path being incorrect (I believe this was the original error).

The second error you are experimenting I am not clear since I can't reproduce. Maybe, check when you are testing manually that your parameters mbox_uri and mbox_path are consistent to just running in a os x terminal. The system2 call should not be doing anything more than executing said the command on the terminal for you after all.

I'd suggest you also empty your environment and only try to use the system2 call with the parameters matching what you have on terminal. R can use variables defined outside function scope if it doesn't find them first in the function. This can sometimes leads to a lot of confusion on diagnosing what went wrong.

lh-zhan commented 1 year ago

Hi Carlos,

Funny discovery, I spent this morning trying to debug but the error persists with helix mbox. However, I was able to run the thrift mbox that you used above with the same filepath setup without any error. The outcome looks like: Screen Shot 2023-04-11 at 9 38 42 AM

Using thrift is actually my main goal of running social smell in the first place. I wanted to test out the entire workflow using the default helix.yml file first before switching to thrift.yml. I guess I can continue my journey of using thrift.yml file to utilize social smell notebook for now.

Again, thank you for all the help and suggestions! (I'll most likely run into more issues and need you help again :) )

carlosparadis commented 1 year ago

Sounds like a plan! I will check later on helix.mbox to see what is not working. I take you used the .mbox file from Codeface, right?

lh-zhan commented 1 year ago

I didn't, I went to https://lists.apache.org/list.html?dev@thrift.apache.org to download mbox from different months to test things out. Now that you mentioned it, I see there is a download link in thrift.yml for thrift mbox, but couldn't find one in helix.yml. Would be very beneficial to specify which helix mbox to download for testing purpose :)

carlosparadis commented 1 year ago

@lh-zhan Honestly, you are better off downloading from where you pointed, as that would be the only way to ensure the dataset is current. The other source I had was from a supplemental material. It is good for testing things out, but it won't take you too far for actual analysis.

I believe I identified the issue and pushed some code changes on #185. Would you mind verifying if that fixes the problem for your original dataset and follow up on that issue, so I could close?

I expect you will encounter the same problem in other mailing lists, as it seems the .mbox fields do vary between projects (contrary to .git that have consistent fields). It works for helix, likely because it contained the same fields as OpenSSL. The fix should hopefully make the function future proof for any project.

carlosparadis commented 1 year ago

Since using the older version of Perceval works for now, I am closing this issue.