Open ninoppp opened 2 years ago
Hey @ninoppp - thanks for raising this issue! Tagging my colleagues @itsibitzi and @joelochlann who should be able to help with some of the differences between Aleph and Datashare as they've got more experience of using these other platforms and might have a better idea of the pros/cons of each vs giant.
In terms of whether I'd recommend giant to a small media organisation, the main thing is that I think you would be the first external organisation to use giant. This would be exciting for us and we'd love to help you get it set up - but it may not be a smooth journey! You may find the Aleph/Datashare docs are more complete - in particular Aleph has been open source from the beginning as far as I know so there's a greater chance their documentation is tried/tested by external users. From a technical perspective, Giant currently uses both Neo4j and elasticsearch databases which can be a challenge to manage - though you can use managed versions of these too to reduce the maintenance effort.
Giant was originally designed with a focus on being able to ingest data as fast as possible. It has an ingestion pipeline that can scale horizontally in order to boost performance. This was based off the guardian's experience of other investigations where getting data in front of journalists as fast as possible was essential.
One defining feature of giant which is coming (very) soon hopefully is the ability to search for text and then view the text highlighted in place - see https://github.com/guardian/giant/pull/38 - with Aleph/Datashare my understanding is that it is common to need to download documents rather than viewing/searching within the platform.
Re the 'platform for investigations' question - this is the only open source tool in the suite...for now! Originally giant was called 'pfi' with the idea that it would include multiple different tools, whrereas now it is focussed on searching/sharing documents securely.
Both Aleph and Datashare have versions you can quickly try out I think! For giant we could potentially give you a demo on a zoom call at some point if you're interested.
Thanks a lot for the response @philmcmahon !
I see the point of being the first one's to adopt it. However, if the software's good and you are motivated to help use every now and then I think it would be worth it.
Actually, my biggest concern: You seem to have AWS S3 baked into the software. Using an external service, especially Amazon, wouldn't really be compatibale with our OPSec model. How difficult would it be to use it with local storage only? And, since we're probably starting out with somewhat limited hardware (24 cores, 64GB Ram, main storage on HDDs), would it be feasable performance whise?
The text highlighting indeed isn't present in the other options - Aleph doesn't have it at all (I think) and Datashare only has it for the extracted text, not inside the PDF/whatever. Quite fancy :)
I've already taken a look at Aleph's and Datashare's demos. Will probably do some testing on our production hardware once that's in place (a few weeks). Was planning to try set up Giant in a VM, but a virtual demo would of course make things a lot easier :) Maybe we could continue this conversationn over email.
this is the only open source tool in the suite...for now
Are there any concrete plans on releasing other tools? If they have compatibility benefits with Giant that would of course make it an even more interesting option...
Hey @ninoppp I've sent you an email but to answer some of your other questions in case useful for others.
How difficult would it be to use it with local storage only?
Definitely possible - we use https://min.io/ when running giant locally. Whilst we've been mostly running giant in AWS recently it was always designed with the idea that it should work offline.
Are there any concrete plans on releasing other tools?
Sorry for the air of mystery there. Right now we don't have any plans. Our team is growing at the moment though so giant development should pick up a bit after a fair while leaving it untouched.
Hi guys,
Pierre from ICIJ here!
Wonderful news to see you opened the source code of Giant on Github.
As suggested in this thread, it would be cool to establish a comparison matrix to help our communities choose a solution.
That might be a joint effort with our friends from Aleph (cc @Rosencrantz).
WDYT?
Thanks @ninoppp for raising the issue :)
Hi @pirhoo @philmcmahon
This sounds like an excellent idea. Some sort of side by side comparison would, I think, be worthwhile to help people make an informed decision on the right platform for their organisation.
@ninoppp Just to clarify a couple of points. Aleph does support/provide search term highlighting and document search without the need to download documents. If you'd like to learn more about using/running aleph we can always add you to our community slack channel.
:)
Hey @pirhoo @Rosencrantz sounds good! Should we have a quick call some time to work out what format it should be in? I guess we could just list the features and work out where there's overlap.
A key thing with giant is that nobody outside the guardian has tried to run it (yet!) so we'll need a bit of a content warning.
Hi guys, what about a call on Friday 13rd? Let say 2pm London time?
Me and my team are on Paris timezone :)
Perfect timing... except I'm going to be on the way back from a meeting in Sarajevo and will be offline, but is there a chance that we'll all be at dataharvest together?
Very nice that see all of you connecting here :) In case you need a third party for something, feel free to hit me up.
Perfect timing... except I'm going to be on the way back from a meeting in Sarajevo and will be offline, but is there a chance that we'll all be at dataharvest together?
I'll be in Dataharvest too :)
Hey, sorry for the slow reply! Sadly we (guardian) won't be at data harvest :( Could we meet when you get back? We're on london time
Absolutely! Sounds like a plan. How about the 26th of May?
I can't make it on the 26th. Maybe the 25th?
So @Rosencrantz and I met in Dataharvest, we partied a little too much and talked about everything but this very specific topic! I'm on my way for some time off until June 8. Maybe we can plan a meeting after that?
Doh I'm not doing a great job of keeping up with this - how about 9th June?
Could work for me!
Hey there
I'm currently checking out different platforms for investigations to set up for a small organization. While we're doing local testing, I'd also like to ask you directly:
What are the main differences between Giant and the two main alternatives? Which features do the others not have, which ones is Giant missing? Why are you maintaining Giant instead of using Datashare or Aleph? Would you recommend it to a small media organization?
Thanks for bothering :)
Edit: another question: You mention a Platform for Investigations suite. I did not find any other tools belonging to that sute publicly, is Giant the only open source one?