Open GoogleCodeExporter opened 8 years ago
That sounds more like an enhancement request than a defect report.
To generalize things a bit, support for a standard data access API would allow
people
to plug in multiple DB backends.
Original comment by tfmorris
on 10 May 2010 at 7:26
One of the major challenges here is how to support undo/redo when changes can
get into
the back-end database without going through Gridworks. Another major challenge
is where
to store metadata (such as reconciliation records) that is specific to
Gridworks and
not native to any existing back-end database.
Original comment by dfhu...@gmail.com
on 10 May 2010 at 7:32
Hmm...I presume a new project would have to be created when doing this, and that
could hold the meta information.
As for undo/redo, maybe adding a commit button would make that easier. So
changes can
be made to a snapshot of the data, but then no changes are made to the DB itself
until commit is pressed?
Original comment by mjlissner
on 10 May 2010 at 7:36
You can either keep it synchronous with the database (effectively using the
database
as the backend), but lose undo/redo support and reconciliation with Freebase
(unless
you add suitable tables to the database). You're then using Gridworks for just
the
facets really.
The other way, as you suggest, is to make it similar to a disconnected session
with a
commit transaction. The issue is then ensuring consistency between the remote
database and the snapshot held in Gridworks. Merging the two back in would be
an
issue. You'd also need to hold keys from the remote database in Gridworks for
updating records.
Frameworks such as Hibernate or Spring would be worth considering for their
database
abstraction layers.
Original comment by iainsproat
on 10 May 2010 at 7:46
I have read that good ORMs such as Hibernate now support ordered lists and
other
features now incorporated into JPA 2.0 as of Dec 2009.
Original comment by thadguidry
on 10 May 2010 at 8:05
Original comment by iainsproat
on 25 May 2010 at 7:59
Original comment by iainsproat
on 14 Oct 2010 at 9:26
Initially, what I believe is most important/useful, is simply having the
ability to direct-connect to MySQL/PostgreSQL/etc. data sources from the
get-go(create a new project). Also initially, being able to set/save multiple
data sources, and multiple dBs within those sources. When the user creates a
new project, this would in effect create a disconnected session(as mentioned
above), wherein the data is treated as is now the data. Adding 'commit'
features can come next, followed by more advanced connectivity options, until
such a point synchronous functionality is in place to one degree or another.
But for now, it would certainly be nice to add data sources and pull data from
those sources!
Eric Jarvies
Original comment by 7...@ericjarvies.com
on 11 Nov 2010 at 7:08
Eric, A cleanup tool using industry best practices is best used offline within
a process. There are existing ETL tools that easily consume from
MySQL/PostgreSQL etc and offer excellent flow control, exporting, and
connectivity to produce delimited files with relative ease. Talend is one such
product that I use along with Google Refine. Talend (Open Source Community
edition) does my scheduled daily gathering from 3 databases (MySQL and Oracle)
and then dumps a customized TSV file that I open with Google Refine for further
analysis and sometimes clean up. There are other tools that provide ETL
(Extract, Transform, Load) like Talend. I'm not 100% sure if the team really
feels the need to copy that and flesh out a full ETL platform, since Talend and
other tools fill that need very nicely. Incidentally, using Google Refine and
a bit of clustering, I was able to find a few loop holes in our data storage
processing that we fixed with a few stored procedures within Oracle. Google
Refine was instrumental as a discovery tool for that. Talend does have an MDM
component but does not have the interactivity of a discovery tool like Google
Refine does. If you do NOT need a daily process, but only one time cleanup,
just dumping with MySQL or PostgreSQL would offer about the same and depending
on the size of database takes only secs to minutes. Dumping can also avoid
potential live database locks, that if Refine supported might have to tip-toe
around, depending on the teams' chosen implementation of database connectivity.
If you have large database size needs, give Talend or another ETL tool a try
with Google Refine, and you'll soon see the powerful left-right combination.
I'm not sure how far the team ultimately decides to absorb direct connectivity
support within Google Refine. I'd like to hear other opinions as well on this
Issue-12.
Original comment by thadguidry
on 11 Nov 2010 at 2:36
I agree with thaguidry. Let the Refine team focus on bringing data quality
issues to light. Let Talend focus on data Quality (they do have an data
profiling tool that can identify some of this stuff) Talend is what we use for
basic ETL. You could write some SQL to get the data out of MySQL anyway. If you
analyze directly connected to db server for data quality against an entire
large table your dba might become angry too.
C
Original comment by Chris.Go...@gmail.com
on 14 Apr 2011 at 2:23
Original issue reported on code.google.com by
mjlissner
on 10 May 2010 at 5:39