Open nevrome opened 1 year ago
roughly, yes. I was only looking for the original file, maybe with a version attached, rather than a complete reconstruction of the forge history of an individual (which I fear might look a bit messy after 3-4 forges). That said, this approach would more than cover my suggestion so 👍 from me conceptually.
Yes, that's great.
What I had in mind was mirroring Thiseas idea, which was just to record this the very first time, so that only the very original source packages are recorded. This could simply be implemented by forge checking if the column Source_Packages
is already present in a row, and if so, do not alter it.
The chain is a pretty cool idea. I am worried that it might become unwieldy though if people do many derived forges. Or is that just a too unlikely scenario?
Another disadvantage of the chain syntax is that we cannot simply use this column to filter on a given source package or so.
How about we just drop the very last arrow? Then we would get after the first forge
Poseidon_ID ... Source_Packages
Ind1 ... PacWithInd1
Ind2 ... PacWithInd2
which would then at least for the 90% of cases where there is only one forge make the output very easily understandable and filterable?
Yes - that sounds like a good compromise.
The arrow doesn't fit too well to our other columns anyway. We should probably just work with a list column, like the ones we already have. So the first forge would produce
Poseidon_ID ... Source_Packages
Ind1 ... PacWithInd1-1.0.1
Ind2 ... PacWithInd2-2.2.4
and the second one
Poseidon_ID ... Source_Packages
Ind1 ... PacWithInd1-1.0.1;outPac-0.1.0
Ind2 ... PacWithInd2-2.2.4;outPac-0.1.0
Which reminds me that we should consolidate how exactly we want to pair package names and package versions in one string, @stschiff. You came up with something like this for #234 already, but maybe we should think about it again to make sure it is future-proof.
Might be getting a bit more complicated, but I think something like PacWithInd1{1.0.1};outPac{0.1.0}
might be more easy to parse downstream?
Yeah, we discussed this with Clemens. I think most languages/repos actually use the hyphon-version (at least that's what I see in python and Haskell I think). What about R?
I would prefer the simple hyphenated version, tbh.
In that case, do we have a restriction that package names cannot contain hyphens?
No we do not need that. It's still parsable, because you just have to make a regex that if a package ends with something like -[0-9]+\.[0-9]+\.[0-9]+\.([0-9]+){0,1}
then that's the version.
I am really scared of using curly brackets to be part of paths/file_names/URLs (in the end we need to think that in the future we might want to use them also in URLs). I think that's not really an option. I think such names should be restricted to alphanumerics, dots and hyphens.
This was raised a couple of months ago by @TCLamnidis and I thought I quickly formalize it into an issue. Let me know if there already is an issue about that in some of our repositories.
The idea is to add an option to forge to document forge operations on a sample/individual level in the .janno file, specifically with a new column that documents the source package the individual is extracted from.
That means a forge operation like this
should create a package
outPac
with a .janno file looking maybe something like this:The new column
Source_Packages
does not necessarily have to be included into the Poseidon schema -- we can discuss that. Also the name and the syntax are up for discussion.If a package with this column is forged again, then we could simply append the next step in the column, so for example like this:
Is this what you had in mind, Thiseas?