tbepler / topaz

Pipeline for particle picking in cryo-electron microscopy images using convolutional neural networks trained from positive and unlabeled examples. Also featuring micrograph and tomogram denoising with DNNs.
GNU General Public License v3.0
179 stars 64 forks source link

Better RELION integration #56

Open biochem-fan opened 4 years ago

biochem-fan commented 4 years ago

I and @scheres are interested in better RELION integration of Topaz.

Several things we wish are:

Some of these can be implemented outside Topaz as a separate converter or a wrapper, but I think it is more efficient to have them inside Topaz itself. For example, a wrapper can make a new working directory and makes symbolic links to relevant files and call Topaz, but this can easily get messy.

@alexjnoble Are you working on any of them? (I saw your tweet: https://twitter.com/alexjamesnoble/status/1267000205838364673) If you are too busy to work on them, I can try myself and send a pull request. Do you have something you don't want to have inside Topaz?

alexjnoble commented 4 years ago

Hi Takanori,

Yes, we have basic Topaz-Relion integration wrappers for denoising and picking provided by a contributor that we are going to add once we test them. Once we add that to the Topaz repository, we will let you know so that you play with them and improve them as you wish.

Best, -Alex

tbepler commented 4 years ago

In addition to what Alex said about RELION wrappers already in the works, I am happy to accept pull requests implementing most of these features as long as they do not change the default topaz interface/behavior. My thoughts on your specific feature requests:

  1. Take a list of micrographs as a file (Issue #47)
    • As I said in the other issue, I'm happy to accept a pull request with this feature for commands that currently only accept micrographs on the command line. Notably topaz extract. The standard unix way to handle this is to allow these commands to read file paths from stdin which has other pipelining advantages. For example, topaz extract < micrograph_paths.txt.
  2. Split star files.
    • Splitting star files can be achieved with the topaz split command, but it writes these to a target output directory. I would accept a pull request implementing individual star files being written to the same location as their corresponding micrographs as an optional argument for extract. This seems like a useful feature.
  3. Respecting directory structure.
    • I'm not sure which specific commands you are referring to, but this is a trickier problem than it appears on the surface. The image list file option for topaz train resolves this problem by explicitly connecting image names with file paths. For commands that write outputs based on image names to a target output directory and therefore overwrite duplicated names, the challenge is to specify how many directory levels are important. At the limit, you would need to duplicate the entire directory structure starting from root. I would accept a workaround option that is to write outputs to the same location as the inputs with some output suffix. This would probably resolve the problem in most cases.
  4. Processing only new files
    • This functionality is better left outside topaz. Pipelining tools like make are designed specifically to handle all of the complexities that come with implementing this well. I think simple and predictable is better in this regard.
biochem-fan commented 4 years ago

@tbepler Thanks for your commend. Sorry, I didn't notice your response.

  1. Respecting directory structure.

In topaz extract, the output txt file contains only the file name without extension (e.g. 001), even when I run the program with topaz extract DatasetA/*.mrc. I want the output to be DatasetA/001.mrc etc to distinguish it from DatasetB/001.mrc. Another situation is when processing images from EPU. EPU generates a directory per grid square (e.g. GridSquare_XXXX/Data/FoilHole_YYYY.mrc). When we split the txt file into individual STAR files, we need the path, otherwise we don't know where to write the file.

tbepler commented 4 years ago

Yes, topaz extract drops the directory and strips the file extension when creating the image name for consistency with topaz train and to make associating the particles with different versions of the micrographs easier. For example, lets say you have DatasetA/raw/ DatasetA/denoised/ DatasetA/corrected/ ... each containing micrographs named in the same way. Then, the particle file output by topaz extract maps to each of these easily. It would be straightforward to add an option to topaz extract to not trim the micrograph paths.

The problem could also be addressed by adding an option to topaz extract to write the particle coordinates out as individual files to the same locations as the inputs. This may be a more elegant solution, because it would also work for topaz denoise and other commands that write micrographs. For example, if inputs are topaz extract DatasetA/001.mrc DatasetA/002.mrc DatasetB/001.mrc DatasetB/002.mrc ..., then the outputs would be DatasetA/001_particles.txt DatasetA/002_particles.txt DatasetB/001_particles.txt DatasetB/002_particles.txt or something like that. The "_particles.txt" part could be a user defined suffix.

huynhk03 commented 4 years ago

Hello, I was wondering how the RELION integration is going? I was giving topaz a try a few days ago and could not figure out how to get the coordinated integrated into the RELION 3.1.0. workflow. Is there currently a workaround for generating coordinate star files from the particles picked from Topaz?

Thank you in advance, Kevin

alexjnoble commented 4 years ago

Hi Kevin,

Sorry for the delay. There are scripts available here for use as Relion 3.1 plugins:

https://github.com/tbepler/topaz/tree/master/relion_run_topaz

The denoising scripts are complete. The picking scripts are still under development, so consider them as beta releases. I hope to find time in the next week to finish those.

Best, -Alex

PiotrDra commented 4 years ago

run_topaz_pick.py executed from RELION 3.1.0

/home/peter/.conda/envs/topaz/run_topaz_pick.py --o External/job028/ --in_mics Select/job015/micrographs.star --number_of_particles 400 --scale_factor 6 --trained_model $

stops with error:

    File "/home/peter/.conda/envs/topaz/run_topaz_pick.py", line 131
    g.to_csv(f'{outpath}{k}_topazpick.star' , sep='\t', index=False, columns=['x_coord','y_coord','score'], header=None)
                                          ^
SyntaxError: invalid syntax

Any suggestions on what could be the reason and how to make the script functional?

tbepler commented 4 years ago

@PiotrDra This sounds like a python version problem. The f'...' syntax requires python 3.6 or newer. Can you check that your topaz install is using python 3?

LizelleLL commented 4 years ago

Hi, My particle is small and consists of 2 homologous globular domains which has made picking very difficult so far and I am really interested to see how Topaz performs, especially with the top-views which are just a dot on the micrograph.

I have tried to use Topaz but cannot manage to extract particles in Relion using the coordinates from Topaz picking. I used the Relion integration of Topaz to Denoise and Train on 7689 micrographs using the particles.star file from my best 3D map as positive labels. A Relion particle extraction job using the coords_suffix_topazpicks.star file (written by run_topaz_pick.py) and micrographs.star file as input failed to extract any particles although many coordinates were written by Topaz picking (stderr: Warning: coordinate file External/job568/__/raw/GridSquare_7115372/Data/FoilHole_8065276_Data_7120079_7120081_20191008_0757_fractions_topazpicks.star does not exist...) Job 568 was the run_topaz_pick.py job

I think that the issue is point 3 made by @biochem-fan : "Respect the directory structure For example, a user might have Dataset1/001.mrc and Dataset2/001.mrc. Currently Topaz only looks at the file name, so these two get mixed up." My dataset was collected using EPU and as mentioned above "EPU generates a directory per grid square (e.g. GridSquare_XXXX/Data/FoilHole_YYYY.mrc)".
My directory structure in previous Relion AutoPick jobs was:
AutoPick/jobXXX/__/raw/GridSquare_XXX/Data/FoilHole_XXX_fractions_autopick.star

My micrographs.star file that I use for run_topaz_pick.py contains the following for each micrograph in the _rlnMicrographName and _rlnCtfImage columns: MotionCorr/jobXXX/__/raw/GridSquare_xxx/Data/FoilHole_xxx_fractions.mrc (there is a double underscore directory one up from raw - github doesn't want to write that) CtfFind/jobXXX/__/raw/GridSquare_xxx/Data/FoilHole_xxx_fractions.ctf:mrc

As you can see, the per-grid-square directory structure is carried through and since it is not maintained by Topaz, I cannot use the generated coordinates for further processing in Relion.

Can you please suggest a work-around for this? I don't have any python knowledge and have no idea how to fix this issue. Regards Lizelle

tbepler commented 4 years ago

As a workaround, running extract on each of the directories individually should solve this problem.

tbepler commented 4 years ago

Extract can also be run once per micrograph, e.g.

for micrograph.mrc in set_of_micrographs;
    topaz extract micrograph.mrc ...

This also allows writing one output file per micrograph.

tbepler commented 4 years ago

commit 752c140a709c745dabdcc2232b6e9444a11e1ef1 adds support for writing extracted coordinates as one file per micrograph and also adds support for piping the micrograph paths to topaz.

biochem-fan commented 4 years ago

This is a dirty patch but solves the issue of working with images scattered in many sub-directories. When I have time, I will refactor this using my STAR file parser.

diff --git a/relion_run_topaz/run_topaz_pick.py b/relion_run_topaz/run_topaz_pick.py
index 198133e..e8f2d64 100644
--- a/relion_run_topaz/run_topaz_pick.py
+++ b/relion_run_topaz/run_topaz_pick.py
@@ -4,16 +4,22 @@
 # This is to run Topaz picker (https://github.com/tbepler/topaz) from Relion as an External job type 
 # Rafael Fernandez-Leiro 2020 - CNIO - rfleiro@cnio.es
 # Alex J. Noble 2020 - NYSBC - anoble@nysbc.org
+# @biochem_fan 2020

 # Run with Relion external job
 # Provide executable in the gui: run_topaz_pick.py
 # Input micrographs.star
 # Provide extra parameters in the parameters tab (scalefactor, trained_model, pick_threshold, select_threshold, skip_pi

+# TODO
+#  Earlier error check
+#  Number of workers
+#  Continue

 """Import >>>"""
 import argparse
 import os
+import re
 """<<< Import"""

 """USAGE >>>"""
@@ -93,13 +99,31 @@ os.system(cmd)
 """make star files >>>"""
 #make star files in the right folder
 print('Making star files...')
-os.system(str('''relion_star_printtable ''')+inargsMics+str(''' data_micrographs _rlnMicrographName | awk -F"/" 'NR==1{
-tmpdf=open(tmpfile).readline().rstrip('\n')
-outopaz_path=outargsPath+tmpdf+'/'
-os.system(str('mkdir ')+outopaz_path+str(';rm ')+tmpfile)
+os.system('relion_star_printtable %s data_micrographs _rlnMicrographName > %s' % (inargsMics, tmpfile))
+
+basename_to_dir = {}
+for line in open(tmpfile):
+       original_filename = line.rstrip()
+       dirname = os.path.dirname(original_filename)
+       filename = os.path.basename(original_filename)
+       filename_without_ext = filename[:filename.rfind('.')]
+       # strip job path
+       m = re.match("[^/]+/job\d+\/", dirname)
+       if m:
+               dirname = dirname[m.end():]
+
+       if filename_without_ext in basename_to_dir:
+               sys.stderr.write("ERROR: Sorry, you cannot have two files with the same, even if they are in different directories")
+               sys.exit(-1)
+       basename_to_dir[filename_without_ext] = dirname
+
+os.remove(tmpfile)
+
 mic_filenames=list(set([x.split('\t')[0] for x in open(outargsResults2).readlines()[1:]]))
 topaz_picks=[x.split('\t') for x in open(outargsResults2).readlines()[1:]]
 for name in mic_filenames:
+       outopaz_path=outargsPath+basename_to_dir[name]+'/'
+       os.makedirs(outopaz_path, exist_ok=True)
        star_file=outopaz_path+name+'_topazpicks.star'
        with open(star_file, 'w') as f:
LizelleLL commented 4 years ago

Hi @biochem-fan and @tbepler . Thanks for the advice, I appreciate it!

We've had some PC issues and I haven't tried the new version yet but the patch looks like a good idea. Unfortunately, I've never used one before and don't quite understand how to use it. Should I modify the run_topaz_pick.py script in my Relion directory to match the one above? Should any lines be removed from the original script?

I'm also not too clear on the usage in Relion. I've denoised the selection of micrographs for model training and used the resulting denoised micrographs.star with trained model as input for picking but this denoised micrographs.star file does not contain the directory names anymore, just the file names. Should I use the denoised or the original selection of micrographs (where the directory names are still present) for picking with this patch applied? I don't see how the directory names would be known if I used the denoised micrographs.star file. However, when I used the micrographs.star file before denoising as an input for topaz picking I obtained zero picks.

Can I apply a similar patch to the denoising script and proceed with picking from denoised micrographs.star before running Extraction in Relion using the topaz_picks_scaled.star file (containing the correct directories as part of the micrograph names) and the original (not denoised) micrographs.star file as input?

biochem-fan commented 4 years ago

@LizelleLL

Unfortunately, I've never used one before and don't quite understand how to use it. Should I modify the run_topaz_pick.py script in my Relion directory to match the one above?

Yes.

Should any lines be removed from the original script?

+ means add the line, - means remove the line.

That being said, if you are not familiar with these things, I recommend you to wait until my patch is tested and incorporated into the official distribution.

Regarding denoising:

Because I myself don't use denoising, it is of lower priority for me. The idea is the same. I hope the original developers work on it.

LizelleLL commented 4 years ago

Thanks for the reply @biochem-fan We have an excellent IT person in our unit who should be able to help me use this patch. After Takanori mentioned that he doesn't use denoising I was wondering @tbepler , is the picking algorithm affected by denoising or is it simply useful for a person to manually evaluate the model training and picking? If it is not affected by denoising, can I use denoised micrographs to choose the best parameters for model training on my current 3D model's particles and also the best parameters for picking, then go back to the original noisy micrographs and use these optimized parameters for training and picking (without manual evaluation since I won't be able to see my particle)? That way I can use Takanori's patch for picking and proceed with processing in Relion. Please let me know if you think this may work Thanks Lizelle

alexjnoble commented 4 years ago

Hi Lizelle,

In our limited tests of training Topaz picking models on raw versus denoised micrographs, we do not see an improvement using denoised micrographs over raw micrographs. So you should use whichever is most convenient for your workflow.

Be aware, however, that we strongly advise that you do not use denoised particles for particle alignment. Please refer to the paragraph on the hallucination problem in the Discussion section of the Topaz-Denoise paper: https://www.nature.com/articles/s41467-020-18952-1

Best, -Alex

LizelleLL commented 4 years ago

Hi All, Our IT guy helped with the patch for topaz picking. He mentioned that there were some line-length problems in the patch from @biochem-fan and I am pasting his cleaner patch here below. The directory structure was maintained and I could use the coordinates for Extraction in Relion. Thanks for all the help!

--- run_topaz_pick.py   2020-10-07 14:48:00.370394000 +0200
+++ /tmp/run_topaz_pick.py  2020-10-21 14:52:34.379936000 +0200
@@ -4,16 +4,22 @@
 # This is to run Topaz picker (https://github.com/tbepler/topaz) from Relion as an External job type 
 # Rafael Fernandez-Leiro 2020 - CNIO - rfleiro@cnio.es
 # Alex J. Noble 2020 - NYSBC - anoble@nysbc.org
+# @biochem_fan 2020

 # Run with Relion external job
 # Provide executable in the gui: run_topaz_pick.py
 # Input micrographs.star
 # Provide extra parameters in the parameters tab (scalefactor, trained_model, pick_threshold, select_threshold, skip_pick)

+# TODO
+#  Earlier error check
+#  Number of workers
+#  Continue

 """Import >>>"""
 import argparse
 import os
+import re
 """<<< Import"""

 """USAGE >>>"""
@@ -93,13 +99,31 @@
 """make star files >>>"""
 #make star files in the right folder
 print('Making star files...')
-os.system(str('''relion_star_printtable ''')+inargsMics+str(''' data_micrographs _rlnMicrographName | awk -F"/" 'NR==1{print $(NF-1)}' > ''')+tmpfile)
-tmpdf=open(tmpfile).readline().rstrip('\n')
-outopaz_path=outargsPath+tmpdf+'/'
-os.system(str('mkdir ')+outopaz_path+str(';rm ')+tmpfile)
+os.system('relion_star_printtable %s data_micrographs _rlnMicrographName > %s' % (inargsMics, tmpfile))
+
+basename_to_dir = {}
+for line in open(tmpfile):
+       original_filename = line.rstrip()
+       dirname = os.path.dirname(original_filename)
+       filename = os.path.basename(original_filename)
+       filename_without_ext = filename[:filename.rfind('.')]
+       # strip job path
+       m = re.match("[^/]+/job\d+\/", dirname)
+       if m:
+               dirname = dirname[m.end():]
+
+       if filename_without_ext in basename_to_dir:
+               sys.stderr.write("ERROR: Sorry, you cannot have two files with the same, even if they are in different directories")
+               sys.exit(-1)
+       basename_to_dir[filename_without_ext] = dirname
+
+os.remove(tmpfile)
+
 mic_filenames=list(set([x.split('\t')[0] for x in open(outargsResults2).readlines()[1:]]))
 topaz_picks=[x.split('\t') for x in open(outargsResults2).readlines()[1:]]
 for name in mic_filenames:
+        outopaz_path=outargsPath+basename_to_dir[name]+'/'
+        os.makedirs(outopaz_path, exist_ok=True)
    star_file=outopaz_path+name+'_topazpicks.star'
    with open(star_file, 'w') as f:
        f.write('# version 30001\n\ndata_\n\nloop_\n_rlnCoordinateX #1\n_rlnCoordinateY #2\n_rlnAutopickFigureOfMerit #3\n')
biochem-fan commented 4 years ago

In the next major update of RELION (3.2, not 3.1.x; hopefully early next year), Topaz wrapper is integrated into an AutoPick job, not as an External job type. It is currently being test in house. With that, problems associated with directories and "Continue" should be solved.

Meanwhile, please use the above patch. @LizelleLL, thanks for feedback and testing.