refgenie / refgenconf

A Python object for standardized reference genome assets.
http://refgenie.databio.org
BSD 2-Clause "Simplified" License
3 stars 6 forks source link

Dev config upgrade #111

Closed xuebingjie1990 closed 3 years ago

xuebingjie1990 commented 4 years ago

create config_upgrade function in RefGenConf:

updated the CONF_STRUCTURE in const.py

Resolve https://github.com/refgenie/refgenie/issues/203

xuebingjie1990 commented 4 years ago
  • make sure the internal class cannot be imported

if this class cannot be imported, how can I set up the cmd in refgenie? refgenie upgrade would need _RefGenConfV03 to create the obj.

stolarczyk commented 4 years ago

Maybe that's the reason why the upgrade method should be bound to current RefGenConf, or a standalone function?

In fact, this makes more sense that a new class consists of the information needed to lift the version over from an older one to itself, rather than supplying this information (regarding future updates) to the previous class. What is more, this way it could be really just copy and paste every time we update config versions.

stolarczyk commented 4 years ago

I'll write a docs page about config reformatting and the new refgenie upgrade functionality.

nsheff commented 4 years ago

I have one small suggestion: should the function be upgrade_config instead of config_upgrade ? This is going along with the mantra that method names should be verbs. I find it a bit easier to follow that way.

xuebingjie1990 commented 4 years ago

could you add some logging

I added more logging. here are the current output. Since I tested on the plant server, all logging shows the digest is from the local fasta file:

Upgrade the v0.3 config file format to v0.4 format.
Loaded AnnotatedSequenceDigestList (1 sequences)
Generate rCRSd genome digest from local fasta file.
Loaded AnnotatedSequenceDigestList (2385 sequences)
Generate hs38d1 genome digest from local fasta file.
Loaded AnnotatedSequenceDigestList (1 sequences)
Generate rCRS genome digest from local fasta file.
Loaded AnnotatedSequenceDigestList (60 sequences)
Generate hg38_chr22 genome digest from local fasta file.
Loaded AnnotatedSequenceDigestList (195 sequences)
Generate hg38 genome digest from local fasta file.
Loaded AnnotatedSequenceDigestList (955 sequences)
Generate rn6 genome digest from local fasta file.
Upgrade '/home/bx2ur/genome_folder' structure.
Create 'data' and 'alias' directories inside '/home/bx2ur/genome_folder'.
Copy genome assets to 'data'. Genomes failed to retrieve genome digest will be ignored.
Use genome digests as identifiers in 'data' directory.  Create symbolic links in 'alias' directory.
Remove genome assets that have been copied to 'data' directory.
nsheff commented 4 years ago

I'm confused about something, maybe it's simple. How are you testing this? I have tried with the correct branches of refgenie and refgen conf and I see this:

refgenie upgrade --target-version 0.4
Upgrading config to v0.4. Current genome identifiers will be replaced with sequence-derived digests and contents inside '/home/nsheff/code/refgenie_sandbox' will be replaced by 'data' and 'alias' directories. For more info visit: http://refgenie.databio.org/. Would you like to proceed? [y/N] y
Upgrade the v0.3 config file format to v0.4 format.
'http://refgenomes.databio.org' is not a compatible refgenieserver instance. Could not determine API endpoint defined by ID: 'v3custom_Id_alias_digest'

I guess that means the upgrade didn't happen (should probably change the note accordingly)...but can it compute the digests if it can't retrieve them? Otherwise, do you require a compliant refgenieserver to test? Then how are you testing since our current server is not compliant?

Separately, I think the next step will be to release a refgenieserver update, then we can do these kinds of tests, right? Such a server would be compatible with both versions, correct? Where are we on making that release?

stolarczyk commented 4 years ago

I've tested this once with a local, 0.4-compatible refgenieserver instance. Yes, it's kind of a pain that you can't test this.

And you're right, I haven't noticed this: the upgrade should be successfully performed for genomes with fasta assets available even if none of the servers the user subscribed to is compatible.

xuebingjie1990 commented 4 years ago

I tested with a local instance and the plant server that is 0.4-compatible. I'll work on upgrade with no compatible servers subscribed.

stolarczyk commented 4 years ago

For the refgenieserver release, since it depends on refgenconf it would be nice to release refgenconf first. We could maybe deploy an instance of refgenieserver that uses the new stack and then test the upgrading?

Such a server would be compatible with both versions, correct?

That was an API version bump so the old API endpoints did not change -- it's compatible, I made the old version compatible with the new config. However, I just realized that there is a problem with the data that they need to serve. We would need to keep two copies of everything, one current and one that follows the old standard. It would be hard to preprocess the data on-the-fly (for backwards compat) since it's archived...

nsheff commented 4 years ago

We would need to keep two copies of everything, one current and one that follows the old standard

but what changed about the data? didn't we just change the identifiers and config format?

nsheff commented 4 years ago

I tested with a local instance and the plant server that is 0.4-compatible. I'll work on upgrade with no compatible servers subscribed.

but how did you test the human genome duplicates on rivanna then?

stolarczyk commented 4 years ago

but what changed about the data? didn't we just change the identifiers and config format?

The files within assets are not associated with any human-readable genome id. So a fasta file within hg38/fasta asset that would be served to the old client would not be named hg38.fa but <digest>.fa. This might pose some issues for the client file handling

xuebingjie1990 commented 4 years ago

but how did you test the human genome duplicates on rivanna then?

I changed the server 'http://refgenomes.databio.org' to the plant server 'http://refgenomes.databio.org:84/' in the config file just for testing. Since Michal said the 'http://refgenomes.databio.org' will be the 0.4-compatible server, I didn't take into consideration whether the server is compatible, just whether the genome is in the subscribed server or not.

xuebingjie1990 commented 4 years ago

So a fasta file within hg38/fasta asset that would be served to the old client would not be named hg38.fa but .fa.

but the genome_folder/alias still have the same naming as the old version, and files are linked to the files in genome_folder/data. if the fasta file within hg38/fasta asset that served to the old client sees genome_folder/alias as its genome_folder, should it still find the right file using hg38.fa?

stolarczyk commented 4 years ago

yeah, that's one option if this works. I'll try to make it work today

nsheff commented 4 years ago

I changed the server 'http://refgenomes.databio.org' to the plant server 'http://refgenomes.databio.org:84/' in the config file just for testing. Since Michal said the 'http://refgenomes.databio.org' will be the 0.4-compatible server, I didn't take into consideration whether the server is compatible, just whether the genome is in the subscribed server or not.

and it doesn't matter that there was no overlap in which genomes the server offered? So you're saying it's only an issue if there's no server, the server doesn't actually have to provide information.

stolarczyk commented 4 years ago

but the genome_folder/alias still have the same naming as the old version, and files are linked to the files in genome_folder/data. if the fasta file within hg38/fasta asset that served to the old client sees genome_folder/alias as its genome_folder, should it still find the right file using hg38.fa?

well, actually no.. I forgot we're not talking about the local asset collection, but served ones. The assets are archived and we serve these archives, so this is not an option. But it might be possible to make changes to the archiver to make this work, which was my initial idea.

xuebingjie1990 commented 4 years ago

and it doesn't matter that there was no overlap in which genomes the server offered? So you're saying it's only an issue if there's no server, the server doesn't actually have to provide information.

when there is no overlap or when there is no server, it will generate the digest from the local fasta. if there is no local fasta asset either, it will prompt the user that those genome will be dropped from the config.

the issue is, when there is server but not compatible, it will stop the upgrade.

xuebingjie1990 commented 4 years ago

Now upgrade_config should work with non-compatible server, or list of servers that is a mix of compatible and non-compatible server.

I added a step to check the list of subscribed servers by get the url of the index page. It will log, but not exit, if none of the subscribed server is compatible. I'm not sure this is the best way to do it. I just think we should let the users know.

stolarczyk commented 4 years ago

http://rg.databio.org:82/

@xuebingjie1990, you can use the newly-deployed test instance which serves assets formatted the new way to test this feature. It serves just couple of genomes, so it will be a perfect test case for the function as it would need to perform the upgrade in various ways, depending on particular genome availability.

xuebingjie1990 commented 4 years ago

@xuebingjie1990, you can use the newly-deployed test instance which serves assets formatted the new way to test this feature.

Done, and fixed a bug.

codecov[bot] commented 4 years ago

Codecov Report

Merging #111 into dev will decrease coverage by 12.70%. The diff coverage is 61.05%.

Impacted file tree graph

@@             Coverage Diff             @@
##              dev     #111       +/-   ##
===========================================
- Coverage   81.19%   68.49%   -12.71%     
===========================================
  Files          26       31        +5     
  Lines        1569     3345     +1776     
===========================================
+ Hits         1274     2291     +1017     
- Misses        295     1054      +759     
Impacted Files Coverage Δ
refgenconf/_version.py 100.00% <ø> (ø)
refgenconf/exceptions.py 100.00% <ø> (+5.55%) :arrow_up:
refgenconf/refgenconf.py 67.82% <ø> (-1.66%) :arrow_down:
setup.py 0.00% <ø> (ø)
refgenconf/progress_bar.py 38.23% <38.23%> (ø)
refgenconf/refgenconf_v03.py 38.35% <38.35%> (ø)
refgenconf/henge.py 69.16% <69.16%> (ø)
refgenconf/seqcol.py 93.18% <93.18%> (ø)
refgenconf/helpers.py 91.24% <93.69%> (+9.75%) :arrow_up:
tests/test_upgrade.py 97.87% <97.87%> (ø)
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 46745c2...a06f9cb. Read the comment docs.

stolarczyk commented 4 years ago

@xuebingjie1990 are you planning to update the lab config file anytime soon or do you want me to do this?

xuebingjie1990 commented 4 years ago

@xuebingjie1990 are you planning to update the lab config file anytime soon or do you want me to do this?

sorry, I just saw this...I just realized another issue. There are some files that are just symbolic links in the lab genome folder. That need to update with the new file structure. I'll work on that tomorrow.

xuebingjie1990 commented 4 years ago

Is this how you did set it up?

The way it works only update the links inside genome_folder. It will replace /genome_folder/ with /genome_folder/data/ also alias to digest in the file name. So if it points to files outside the genome_folder the src path would no change.

The code it's around line 190 to 198 in helper.py, if you want to make sure.

Also, I might no be able to make the upgrade on lab config any time soon. My internet is down. I'm on my phone right now.

stolarczyk commented 4 years ago

The way it works only update the links inside genome_folder. It will replace /genome_folder/ with /genome_folder/data/ also alias to digest in the file name. So if it points to files outside the genome_folder the src path would no change.

I referred to the targets of the symbolic links in genome_folder in my comment. Not sure if that's clear.

xuebingjie1990 commented 4 years ago

I referred to the targets of the symbolic links in genome_folder in my comment. Not sure if that's clear.

The target is the file that is a link right? The old target will be removed before create a new link to the correct source file.

stolarczyk commented 4 years ago

The target is the file that is a link right?

No. The target is the file the symbolic link points to.

Here's a visualization of the problem I'm concerned about:

[mstolarczyk@MichalsMBP testing]: tree /Users/mstolarczyk/Desktop/testing/testlinks
/Users/mstolarczyk/Desktop/testing/testlinks
├── dir
│   └── dir1
│       └── lnk.txt -> /Users/mstolarczyk/Desktop/testing/testlinks/src.txt
└── src.txt

2 directories, 2 files
[mstolarczyk@MichalsMBP testing]: ipy
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.5.0 -- An enhanced Interactive Python. Type '?' for help.

autoreload startup script has been executed!

In [1]: import shutil                                                                                                                                                                                     

In [2]: shutil.copytree("/Users/mstolarczyk/Desktop/testing/testlinks", "/Users/mstolarczyk/Desktop/testlinks_cpy", symlinks=True)                                                                        
Out[2]: '/Users/mstolarczyk/Desktop/testlinks_cpy'

In [3]: quit                                                                                                                                                                                              
[mstolarczyk@MichalsMBP testing]: tree /Users/mstolarczyk/Desktop/testlinks_cpy
/Users/mstolarczyk/Desktop/testlinks_cpy
├── dir
│   └── dir1
│       └── lnk.txt -> /Users/mstolarczyk/Desktop/testing/testlinks/src.txt *
└── src.txt

2 directories, 2 files

File marked with * above will become a dangling symlink since /Users/mstolarczyk/Desktop/testing/testlinks/src.txt will be removed in the next step.

Again, I'm not saying you didn't do this right. I didn't test it ;) Just wanted to bring your attention to this issue.

xuebingjie1990 commented 4 years ago

Just wanted to bring your attention to this issue.

ok, I just the terminology mixed up. But, yes, the target is being updated to /Users/mstolarczyk/Desktop/testlinks_cpy/src.txt in your example. But if the target is /Users/mstolarczyk/Desktop/some/other/path/src.txt, then there is no change, and will become a dangling symlink if that target is deleted.

stolarczyk commented 4 years ago

ok, great. That's the approach I suggested here: https://github.com/refgenie/refgenconf/pull/111#pullrequestreview-515659362