sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
18 stars 12 forks source link

Data Schema #226

Open carlosparadis opened 1 year ago

carlosparadis commented 1 year ago

Kaiaulu architecture was partially planned, partially built as features were added. Without familiarity with the domain, it is not always easy to see where the tables connect.

The goal of this issue is making some of these tables relationship more explicit and give you a better understanding of how the issue tracker data connects to other tables the tool can mine.

Our goal is to examine a few notebooks in itm0 (note you do not need to compile them, just read straight from itm0) and identify how the tables can be connected. Contrary to #225, there is a single deliverable here: The MySQL Workbench file with an entity relationship diagram of the data, so it is expected you work together for this issue. The tables in the Notebook should be represented as tables and columns in the MySQL Workbench, and the columns with relation connected. If you see the potential of columns being connected, even if the information is not a exact match, please note them too.

The Notebooks you should divide understanding should be (See table at the end of each notebook):

Please let me know in this issue how you will split the Notebooks between in you two for the next week. For Notebook questions, post on Discussions. For questions concerned the task post here. Please pull request the .mwb to the private repo by Friday 09/15.

Rubegen commented 1 year ago

I will take care of bug_count.html and line_metrics_showcase.html and Waylon will handle the other two.

carlosparadis commented 1 year ago

@Rubegen @waylonho I've modified the private project so you can both commit the .mbw to the private repo. Simple follow the instructions on how to commit it has and send the file there. In addition, please post the .png of what you currently have as a comment here (preferably by tomorrow so I have time to look at again).

For this week, let's just consider this Notebook since the unit tests will keep you both busy too:

The table of interest is towards the end. Try to also understand from the Notebook what the data is trying to tell us and we can iterate on call. In addition, try to understand the "data granularity". Up to this point we were looking at a change of a file (commit) as the granularity of the data. You will notice this contain commit intervals.

Rubegen commented 1 year ago

Here is the .png of the data schema tables

DataSchemaTables
carlosparadis commented 1 year ago

Adding for the record, since it was only discussed on call: Task for Week 4 - Sep 22 was to add the Git Log to this database schema. @waylonho did you find the schema with the table that was due for this week to post here?

waylonho commented 12 months ago

Here is the updated schema. Two questions:

Not entirely sure where the Social Smells Showcase connects to. Is it linked to one of the tables in the diagram or another notebook?

The Git Log table has the filepath.c as FK, is that right?

Schema

carlosparadis commented 12 months ago

Hi Weylon,

I am not sure the table for Exploring Git Log makes much sense: It is just a list of filepaths? Could you explain it to me?

The social smells table connects via commit interval --- but you have to design the tables around the fact it is a commit interval, rather than a single commit. In theory they would connect to your Explore Git Log, since thats where your commit hashes are...but I am not sure what is going on your current table.

Can you paste a screenshot of what table you are using? It should have been a project_git table.

waylonho commented 12 months ago

I used the table from http://itm0.shidler.hawaii.edu/kaiaulu/articles/gitlog_showcase.html#visualizing-the-git-log. be975c9a5437f32b6adbba3abf9051ee

Looking over it, it probably makes sense it was supposed to be the table with commit_hash from http://itm0.shidler.hawaii.edu/kaiaulu/articles/gitlog_entity_showcase.html. Will fix shortly.

carlosparadis commented 12 months ago

Hi,

I see now. There is a misunderstanding: When we last discussed the Project Git Log Notebook, I mentioned the table was not available on itm0. I sent it on Zoom, and you confirmed downloading it.

The table you should be using is this: https://drive.google.com/drive/u/2/folders/1XdSZ4YEZFYRTKz8EGAf2UnysKpyoLrWW

Git Log Entity Table is something else we have not discussed yet.

waylonho commented 12 months ago

I've updated the project_git table.

Schema

carlosparadis commented 12 months ago

Hi, this is good, thank you! Since we are closing in on your milestone report, and I want to make sure you can devote time to finish the git sample fake data, I expanded the diagram for you both. A few caveats that may be incomplete or missing on what I put together:

  1. We need to make more clear what Notebooks the tables are coming from
  2. Some table names may need to be improved
  3. Columns are still missing or may have inconsistent name to Kaiaulu actual tables
  4. Still need to verify the cardinality of the relationships make sense and are correct (i.e. the chicken feet)

In essence, I am removing from you the task of reverse engineering the tables that are relevant from the Notebooks, and decreasing your effort in seeing how they connect with this .mwb file. Your goal is now to improve this than create from scratch. There are also some table that are stubs and need thinking on how to connect. You can think of these as the last steps. If you can work those out on your own and with minimal Q&A towards the end, I'd say you both got the deliverable and the mindset to do this in the future.

I also want you to check how I used the layers from the left pane, and also established the foreign keys. On the .mwb, double click any table, and on the bottom there is a list of tabs. Pick a table that has the connection, and select the foreign key tab. One of the two tables connected will have information there. Use that pane instead of the drag and drop so it doesn't generate additional columns. If what left pane and bottom tabs doesn't make sense to you, please ask me on call.

The tables you have no idea where to begin, you should know the drill by now: Ask questions here. As you can see, Kaiaulu scope of analysis is fairly large. So it is vital you continue to ask for information on where things are if they are not clear. That, in itself, much like your experience report, is indication documentation is lacking on the Notebooks too. That being said, it is important you at least try to locate or guess where the information is too.

The relationships on the expanded table, I hope, should facilitate you seeing the forest for the trees for what we last talked about: "Source Code", "Git Log", "Issue Tracker". I also included table stubs for the ones you made, and layers so you can more easily see how some tables relate.

Please try and understand what the data is capturing too, or you will be unable to explain what you are working on your presentations. I promise there is logic to the madness.

Moving forward, lets try to pick one layer at a time, and refine them as we go comparing to the Notebooks.

You can find the editable .mwb here, where some sample files that are not visible on the itm0 are stored: https://drive.google.com/drive/u/2/folders/1XdSZ4YEZFYRTKz8EGAf2UnysKpyoLrWW

kaiaulu_v1

Let me know if you have questions.

waylonho commented 11 months ago

Hi. I've connected some of the table stubs that weren't connected before. I see how everything is set up and connected more thoroughly now, but still some points of confusion:

I added the project_git table and connected it to Social Smells, is this correct?

Also, just want to make sure, there are some tables in the previous ERD that aren't on the bigger one. Are we supposed to add every table from our old diagram that isn't on this one to this one? Just want to clarify because I am not sure. Will edit it more as we go on.

kaiaulupng

carlosparadis commented 11 months ago

You should not add the project_git table. The commit table is what you are calling project_git. Just add the remaining columns there.

Don't worry about the social smells table for now. Try to fill the information on Source Code, Commit and Issue Tracker. Have a look on Google Drive too, I should have added a few more tables there. Feel free to check with me here the URL of the tables before filling information to save you time.

waylonho commented 11 months ago

Hello,

I have added information from the tables on Google Drive to the diagram. Just one more question, is the table at http://itm0.shidler.hawaii.edu/kaiaulu/articles/depends_showcase.html the "files" table in the Source Code part?

kaiaulu

carlosparadis commented 11 months ago

@waylonho No, the table in the notebook is the "dependencies" table. Depends, the tool the Notebook depends on, outputs a graph of dependencies between files. A graph is represented by a "nodes" table, and an "edgelist" table. The table you see on the Notebook is just the edgelist table.

So it is safe to say the "files" table is complete, just fill the "dependencies" table :)

waylonho commented 11 months ago

Not sure if it's supposed to be, but the project_dependencies file in the src folder on drive is a json.