netcreateorg / netcreate-2018

Please report bugs, problems, ideas in the project Issues page: https://github.com/netcreateorg/netcreate-2018/issues
Other
11 stars 2 forks source link

Import Data #176

Open benloh opened 2 years ago

benloh commented 2 years ago

For the Feb 2022 pilots/tests, we want to prioritize:

  1. Importing nodes/edges to an existing database

The node/edge data might be created from scratch or created by first exporting existing nodes/edges. (e.g. Main Use Model and Secondary Use Model, above)

  1. Importing nodes/edges to a NEW database

This may be addressed in the future. It will not be implemented at this moment as there is a workaround via manual template editing and nc-multiplex.

To Do

benloh commented 2 years ago

@jdanish @kalanicraig I have some questions about how to handle imports.

Currently NetCreate is designed to work with a single starting database. When you start up the app, you have to specify a specific database file (e.g. ./nc.js --dataset=tacitus).

So when you import new data, do you intend to:

a) Add new records to existing database? And if there is an existing record, do you want to overwrite it?

or

b) Replace all existing records in the current database with the imported records?

or

c) Create a new database with the imported records, giving the database a new name (or really starting with a new empty database).

Each one of the three would require a slightly different use model and workflow. Or do you need to support all three different use models?

kalanicraig commented 2 years ago

I can see a use for a modified A (append only, no edits to existing rows) and C (for new or heavily modified datasets). B would be nice but could be accommodated by A and C, I think. I’d start with C since it lets us use the export feature to get a dataset out, mod it externally, and reimport into the Net.Create environment if necessary.

Responding to export requests next!

—k

On Dec 21, 2021, at 5:05 PM, benloh @.***> wrote:

 @jdanish @kalanicraig I have some questions about how to handle imports.

Currently NetCreate is designed to work with a single starting database. When you start up the app, you have to specify a specific database file (e.g. ./nc.js --dataset=tacitus).

So when you import new data, do you intend to:

a) Add new records to existing database? And if there is an existing record, do you want to overwrite it?

or

b) Replace all existing records in the current database with the imported records?

or

c) Create a new database with the imported records, giving the database a new name (or really starting with a new empty database).

Each one of the three would require a slightly different use model and workflow. Or do you need to support all three different use models?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

kalanicraig commented 2 years ago

To clarify: The most frequent things we see for import needs are new batch node additions with some new edges, or a bug batch edge import with few new nodes. C first, and then templates, and then A-append-only would get us more coverage of existing needs.

If we get to A (I’d prioritize template creation and editing after import option C), it would assume no node disambiguation on import, and edge lookup based on first-node-label-match.

I’d rather folks handle node disambiguation using the existing edit/delete/merge features for now than pile time into an import feature that is infrequently used.

On Wed, Dec 22, 2021 at 10:06 AM Kalani Craig @.***> wrote:

I can see a use for a modified A (append only, no edits to existing rows) and C (for new or heavily modified datasets). B would be nice but could be accommodated by A and C, I think. I’d start with C since it lets us use the export feature to get a dataset out, mod it externally, and reimport into the Net.Create environment if necessary.

Responding to export requests next!

—k

On Dec 21, 2021, at 5:05 PM, benloh @.***> wrote:



@jdanish https://github.com/jdanish @kalanicraig https://github.com/kalanicraig I have some questions about how to handle imports.

Currently NetCreate is designed to work with a single starting database. When you start up the app, you have to specify a specific database file (e.g. ./nc.js --dataset=tacitus).

So when you import new data, do you intend to:

a) Add new records to existing database? And if there is an existing record, do you want to overwrite it?

or

b) Replace all existing records in the current database with the imported records?

or

c) Create a new database with the imported records, giving the database a new name (or really starting with a new empty database).

Each one of the three would require a slightly different use model and workflow. Or do you need to support all three different use models?

— Reply to this email directly, view it on GitHub https://github.com/netcreateorg/netcreate-2018/issues/176#issuecomment-999124640, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKL4NDQB5ENQ42AEKPURELUSD235ANCNFSM5IIVCBSA . You are receiving this because you were mentioned.Message ID: @.***>

benloh commented 2 years ago
  1. Actually A) and B) are relatively easy to do. C) is much harder because of the way the system is currently designed. We'd need to build a new framework for creating and loading a database AFTER the app has already started. It's possible it'll be more straightforward to build it as a new parameter when starting the app from the command line, e.g.: ./nc.js --import_nodes=tacitus_nodes.csv --import_edges=tacitus_edges.csv

If C) is the priority, then importing is much more complex.

  1. One more question: Which fields are required and which are optional when importing? Right now we are blindly requiring ALL of the fields in the import data table in order to be valid:
// For Nodes
id,label,attributes:Node_Type,attributes:Extra Info,attributes:Notes,degrees,meta:created,meta:updated

// For Edges
id,source,target,attributes:Relationship,attributes:Info,attributes:Citations,attributes:Category,attributes:Notes,meta:created,meta:updated

For any given record, you can have empty fields, but we expect the table format to have all of these fields defined. I'm guessing you probably need more flexibility than that?

jdanish commented 2 years ago

If we use the nc-multiplex to create the new file does that make it easier? Not sure that fits Kalani’s use case but figured I’d ask.

benloh commented 2 years ago

If we added import to the regular nc.js startup script, we'd probably have to make a corresponding change with nc-multiplex to make it work. What might be slightly easier would be to figure out a way to initiate a new blank db with a new name, then allow the upload via the web interface (otherwise you'd have to have direct access to the server to upload files there and import files directly, which now that I think about it, sounds like a terrible solution).

kalanicraig commented 2 years ago

The issue of required or optional fields and template creation seem like they're related.

IF we assume that a re-import of data creates a new dataset, then the list of fields would be controllable by an exported file. Let’s require all fields to exist (even if the values are blank), and the documentation will note that, along with suggesting an export from a template.

If we treat option C as a new network, then it would essentially be a new network and a new template, but using existing values from an existing network for the template alone and then proceeding with the append option.

On Dec 22, 2021, at 2:31 PM, benloh @.***> wrote:

If we added import to the regular nc.js startup script, we'd probably have to make a corresponding change with nc-multiplex to make it work. What might be slightly easier would be to figure out a way to initiate a new blank db with a new name, then allow the upload via the web interface (otherwise you'd have to have direct access to the server to upload files there and import files directly, which now that I think about it, sounds like a terrible solution).

— Reply to this email directly, view it on GitHub https://github.com/netcreateorg/netcreate-2018/issues/176#issuecomment-999814663, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKL4NFL4DCJTBV5DX6LNPDUSIRQPANCNFSM5IIVCBSA. You are receiving this because you were mentioned.

kalanicraig commented 2 years ago

In which case, we’d need to rethink my (bad) idea about using node labels instead of node IDs for edge imports.

On Dec 22, 2021, at 2:47 PM, Kalani Craig @.***> wrote:

The issue of required or optional fields and template creation seem like they're related.

IF we assume that a re-import of data creates a new dataset, then the list of fields would be controllable by an exported file. Let’s require all fields to exist (even if the values are blank), and the documentation will note that, along with suggesting an export from a template.

If we treat option C as a new network, then it would essentially be a new network and a new template, but using existing values from an existing network for the template alone and then proceeding with the append option.

On Dec 22, 2021, at 2:31 PM, benloh @. @.>> wrote:

If we added import to the regular nc.js startup script, we'd probably have to make a corresponding change with nc-multiplex to make it work. What might be slightly easier would be to figure out a way to initiate a new blank db with a new name, then allow the upload via the web interface (otherwise you'd have to have direct access to the server to upload files there and import files directly, which now that I think about it, sounds like a terrible solution).

— Reply to this email directly, view it on GitHub https://github.com/netcreateorg/netcreate-2018/issues/176#issuecomment-999814663, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKL4NFL4DCJTBV5DX6LNPDUSIRQPANCNFSM5IIVCBSA. You are receiving this because you were mentioned.

benloh commented 2 years ago

@kalanicraig This is getting complicated. It sounds like we should at least take a pass through the template editing design before we fully address this. But is this correct:

Main Use Model

  1. Export nodes and edges from an existing database
  2. Modify it externally (either by importing into another application or Excel)
  3. Re-import nodes and edges, replacing existing nodes and edges, and adding new ones

Secondary Use Model

  1. Create new nodes and edges in another application or Excel
  2. Import the nodes and edges into an existing database, replacing any existing nodes and edges

Tertiary Use Model

  1. Create new nodes and edges file either by exporting or creating in another application or Excel
  2. Create, copy, or edit a template
  3. Create a new database with the new template, node file, and edge file.

In all cases, I imagine it might be useful to have a Dry Run feature where you can test the import and get a report that lists the nodes and edges that are added or replaced? If you like the Dry Run, you can then press Import do to the actual import?

jdanish commented 2 years ago

I'll defer to Kalani but wanted to clarify: if we have 2 and 3 from the tertiary model, then really the only difference between the tertiary and secondary would be doing it "all at once" in which case I think we can drop the tertiary? Or am I missing something? Thanks!

benloh commented 2 years ago

I think the main difference in the tertiary is the addition of the template file and not modifying an existing database. Part of the reason I'm teasing these all out is to make sure that the workflow is supported by whatever scheme we come up with, especially if they require slightly different methods (e.g. creating a new db), and biasing the design towards one model vs another (e.g. if you only rarely do the tertiary model, then it's OK if it's a little more difficult to do).

kalanicraig commented 2 years ago

You’re right about the main difference, and I think that makes the tertiary model mostly unnecessary. We can handle it with documentation: create a new blank DB with the template creation and then work with the secondary model for import/export (its just that the export would be blank).

On Tue, Jan 11, 2022 at 1:26 PM benloh @.***> wrote:

I think the main difference in the tertiary is the template file and not modifying an existing database. Part of the reason I'm teasing these all out is to make sure that the workflow is supported by whatever scheme we come up with, especially if they require slightly different methods (e.g. creating a new db), and biasing the design towards one model vs another (e.g. if you only rarely do the tertiary model, then it's OK if it's a little more difficult to do).

— Reply to this email directly, view it on GitHub https://github.com/netcreateorg/netcreate-2018/issues/176#issuecomment-1010242163, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKL4NCJLPQJRPGOLVM3HD3UVRY3XANCNFSM5IIVCBSA . You are receiving this because you were mentioned.Message ID: @.***>

benloh commented 2 years ago

@kalanicraig Just confirming then, that with the emphasis on importing for the Feb 2022 pilots/tests, we want to prioritize:

  1. Importing nodes/edges to an existing database

The node/edge data might be created from scratch or created by first exporting existing nodes/edges. (e.g. Main Use Model and Secondary Use Model, above)

  1. Importing nodes/edges to a NEW database

It sounds like this is not as urgent and can be handled via other existing means for editing templates and creating new databases. (e.g. Tertiary Use Model above). While this would be a nice addition, it requires a substantial amount of rework of both netcreate and nc-multiplex.

kalanicraig commented 2 years ago

Correct.

On Jan 17, 2022, at 1:46 PM, benloh @.***> wrote:

@kalanicraig https://github.com/kalanicraig Just confirming then, that with the emphasis on importing for the Feb 2022 pilots/tests, we want to prioritize:

Importing nodes/edges to an existing database The node/edge data might be created from scratch or created by first exporting existing nodes/edges. (e.g. Main Use Model and Secondary Use Model, above)

Importing nodes/edges to a NEW database It sounds like this is not as urgent and can be handled via other existing means for editing templates and creating new databases. (e.g. Tertiary Use Model above). While this would be a nice addition, it requires a substantial amount of rework of both netcreate and nc-multiplex.

— Reply to this email directly, view it on GitHub https://github.com/netcreateorg/netcreate-2018/issues/176#issuecomment-1014810390, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKL4NDT2XQXQ52NVHRHL3LUWRPWVANCNFSM5IIVCBSA. You are receiving this because you were mentioned.

benloh commented 2 years ago

@kalanicraig @jdanish One more question about importing:

What kind of restrictions should we place on importing?

On a related, note, I was assuming that we don't need to place similar restrictions on exporting. But perhaps you do want a way to lock a database too so that people can't arbitrarily export data?

jdanish commented 2 years ago

If login is required to see the network, doesn't that mean you can't get to the import tab without logging in? Either way, let's say you need to be logged in and be in admin mode to import. My inclination is to say that if you can see the network, you can export it.

Also, by the way, what are you using to edit the csv files in your testing? The reason I ask is that we did a quick test and Excel appears to cause problems. Literally opening and saving a csv in excel seems to break the import even without intentionally editing.

benloh commented 2 years ago

If login is required to see the network, doesn't that mean you can't get to the import tab without logging in?

For a project that requires login, yes, you wouldn't see the tab. But for a project that doesn't require login, you see the tab immediately. However, we can still restrict it so that you have to still login to be able to import. We would just add an extra level of hiding: e.g. if you're not logged in, the import buttons are grayed out or missing.

So for example, you could allow users to import if they're logged in even if they are NOT admins.

what are you using to edit the csv files in your testing?

Export was broken -- it was not properly accounting for missing data, so the fields were getting shifted. It's fixed in the latest branch (import), but there's lots of other stuff that is still broken.

I sometimes open the file directly in VSCode, other times I use Numbers and Excel. But if you look at the exported data directly in VSCode and count the number of data points vs the headers, you'll probably find that you're missing a few data points -- that's causing the data corruption.

jdanish commented 2 years ago

I’d say let’s require login as described.

And cool, we will test again when you tell us to.

Thanks!


(from my iPhone)

Joshua Danish http://www.joshuadanish.com

On Mar 5, 2022, at 1:01 PM, benloh @.***> wrote:

 If login is required to see the network, doesn't that mean you can't get to the import tab without logging in?

For a project that requires login, yes, you wouldn't see the tab. But for a project that doesn't require login, you see the tab immediately. However, we can still restrict it so that you have to still login to be able to import. We would just add an extra level of hiding: e.g. if you're not logged in, the import buttons are grayed out or missing.

what are you using to edit the csv files in your testing?

Export was broken -- it was not properly accounting for missing data, so the fields were getting shifted. It's fixed in the latest branch (import), but there's lots of other stuff that is still broken.

I sometimes open the file directly in VSCode, other times I use Numbers and Excel. But if you look at the exported data directly in VSCode and count the number of data points vs the headers, you'll probably find that you're missing a few data points -- that's causing the data corruption.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

benloh commented 2 years ago

Sorry, thinking this through some more, I can see a situation where you don't want any old user doing imports: e.g. it's the first class with 50 students, everyone's working on a shared network. You don't want some wiseass to clobber the whole network.

But later on you might want to allow students to import mini-networks.

This suggests that we add a Template option allowImport. By default, it's false and only admins can import. If it's true, then anyone logged in can import. For a network that does not require login, the Import section is hidden or grayed out.

Or maybe I'm overthinking it?

kalanicraig commented 2 years ago

I like this idea a lot. Lets admins turn import on once they have a sense of whether imports are a good or bad idea, sets up some control without making it too complicated

On Sat, Mar 5, 2022 at 1:32 PM benloh @.***> wrote:

Sorry, thinking this through some more, I can see a situation where you don't want any old user doing imports: e.g. it's the first class with 50 students, everyone's working on a shared network. You don't want some wiseass to clobber the whole network.

But later on you might want to allow students to import mini-networks.

This suggests that we add a Template option allowImport. By default, it's false and only admins can import. If it's true, then anyone logged in can import. For a network that does not require login, the Import section is hidden or grayed out.

Or maybe I'm overthinking it?

— Reply to this email directly, view it on GitHub https://github.com/netcreateorg/netcreate-2018/issues/176#issuecomment-1059811646, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKL4NHK7CO722M7BJBBMCDU6OSCHANCNFSM5IIVCBSA . You are receiving this because you were mentioned.Message ID: @.***>