sandberg-lab / Spreading-Correction

Supplementary information to "Computational correction of index switching in multiplexed sequencing libraries" (Larsson et. al 2018).
MIT License
15 stars 4 forks source link

Tipps for applying "Spreading-Correction" on Single Cell Genome Assemblies? #2

Closed jvollme closed 4 years ago

jvollme commented 6 years ago

Hello, I want to try to apply this tool in order to rescue our bacterial single cell data that showed extremely high levels of cross-contamination when multiplexed on the HiSeq Xten. Since our librarys are based on Multiple Displacement Amplification (MDA) products, they have highly uneven coverage, similar to transcriptome data, and simple coverage cutoffs are not enough to identify cross-contaminants. Also it is not enough to simply identify contigs occuring in multiple samples and just attributing them to the one library where they have the highest coverage, since I may have a few single cells originating from the same species. So your tool seems to be a blessing here.

My plan to apply your workflow for contamination here is as follows: 1.) assembled the data of each library seperately 2.) cluster the contigs of all assemblies using strict identity cutoffs, to obtain representative, mappable consensus-contigs for each (potential) contaminant 3.) Map reads onto the clustered assemblies in order to obtain coverage values for each contig-cluster in each library and create input files similar to your transcriptome input. 4.)use your script to correct the coverage data with respect to the cross-contamination and then filter contaminating contigs from my datasets based on coverage cutoffs.

However, since the contigs are differently sized, (and moreover are unlikely to be fully complete in all cross-contaminated samples), I do not want to use simple read counts per contig, but rather average coverage values (e.g. mean read coverage per base position). This would mostly result in decimal values.

Since your example input seems to consist exclusively of integer values: Does your script also accept decimal/float coverage values in the input table?Or do I need to round such data? Or would you recommend a completely different approach here?

AntonJMLarsson commented 6 years ago

When I wrote the script I had single cell transcriptomics in mind, but I think this script should work for genome assemblies as well.

To answer your questions:

The approach does in principle accept decimal values, except I do explicitly round values in the script (line 185 np.rint() ) which would have to be removed. You will probably have to use --h to specify which value 'high coverage' should be (in read count terms it is set to 30 by default). You might be able to establish the true contig coverage to spread contig coverage using this. The diagnostic plots should be able to guide you to some extent.

You should also set the cutoff to an appropriate value, try setting it to 0 to begin with.

Let me know how it goes!

Anton

jvollme commented 6 years ago

Thanks, for the info. Just to clarify though: From the arguments description (The number of reads to use to be considered highly expressed in only one cell), I guess "--h" should be the minimum coverage value that can not be achieved by cross-contamination? Or is it just important that it is definitively above the maximum coverage that can be achieved by cross-contamination?

AntonJMLarsson commented 6 years ago

--h will specify the coverage value to use to look for cells where the coverage value is above h in only one cell (and nonzero in other cells). I think the former option is fine, as in the minimum coverage value that can not be achieved by cross-contamination from one source.

jvollme commented 6 years ago

OK, thanks, it may take some time though until I can post how it went, because the clustering seems to take a lot longer than anticipated.

jvollme commented 6 years ago

Hi, when running the tool i get the following error message:

  from pandas.core import datetools
Reading file: rounded_transposed_for_correction.csv
Traceback (most recent call last):
  File "/home/ww5070/tools/Spreading-Correction/unspread.py", line 69, in <module>
    df_noindex = df.drop([i5_index_name,i7_index_name], axis=1).astype(np.int)
  File "/home/ww5070/tools/miniconda2/envs/py36_env/lib/python3.6/site-packages/pandas/util/_decorators.py", line 118, in wrapper
    return func(*args, **kwargs)
  File "/home/ww5070/tools/miniconda2/envs/py36_env/lib/python3.6/site-packages/pandas/core/generic.py", line 4004, in astype
    **kwargs)
  File "/home/ww5070/tools/miniconda2/envs/py36_env/lib/python3.6/site-packages/pandas/core/internals.py", line 3462, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "/home/kit/ibg/ww5070/tools/miniconda2/envs/py36_env/lib/python3.6/site-packages/pandas/core/internals.py", line 3329, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/ww5070/tools/miniconda2/envs/py36_env/lib/python3.6/site-packages/pandas/core/internals.py", line 544, in astype
    **kwargs)
  File "/home/ww5070/tools/miniconda2/envs/py36_env/lib/python3.6/site-packages/pandas/core/internals.py", line 625, in _astype
    values = astype_nansafe(values.ravel(), dtype, copy=True)
  File "/home/ww5070/tools/miniconda2/envs/py36_env/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 687, in astype_nansafe
    raise ValueError('Cannot convert non-finite values (NA or inf) to '
ValueError: Cannot convert non-finite values (NA or inf) to integer

my command-line invocation is: ${HOME}/tools/Spreading-Correction/unspread.py rounded_transposed_for_correction.csv --i5 "i5.index.name" --i7 "i7.index.name" --h 80 --c 2 --column 0

my input file is attached here: rounded_transposed_for_correction.zip

Can you tell what I am doing wrong?

AntonJMLarsson commented 6 years ago

You seem to have invalid values in your data (ValueError). I've added a line which turns these values to 0. You should avoid this problem now. However in addition you seem to need to specify the number of rows and columns in your plate (default is a 384 well plate and you have 69 rows in your dataframe so you should consider the layout that you used).

jvollme commented 6 years ago

Ah ok, the problem was, of course, the input file. The problem was that i pasted the coverage and index infos together from two seperate tables in text format. Turns out one of them was edited in windows and had inappropriate line endings, which were not shown in unix editors.

With this in mind, I think it is probably best to revert that change (setting invalid values to "0") again. I think it is definitively better to have the script abort with an error message when finding an "invalid value" and have the user find out why, than to risk having wrong correction results because the fault in the input was never detected. Instead perhaps it could be possible to include the linenumbers of input lines containing invalid values in the error message?

Concerning the layout, yes, I meant to put in the appropriate --row and --col arguments but forgot and then ran into the other problem first. So now, with the input corrected and the appropriate arguments given, the script now ran successfully.

However, the results where not as clear as I had hoped. Just judging from my coverage table, I can see rather obvious contaminations (e.g. coverage <1000x in one sample, but only >70x in other samples with different expected taxonomic affiliation but sharing one of the index primers). The script however reports that "acceptable spread" has been detected and that there is "nothing to do". Reducing the "acceptable-spread" to <= 0.01 then prompts correction, but the adjustments often seem minimal (reducing the corrected coverage of a highly likely contaminant from 18x to 11x or from 42x to 32x). However in my case i need a "yes" or "no" classification of potential contaminants, as they have to be absolutely removed from each dataset (reduced observed coverage of contaminating contigs will hardly affect the genome analyses). Playing around with the "--h" and "--t" parameters does not seem to change the results.

My guess ist that my current workflow (contig-clustering and use of average-per-base-coverage as input values) or perhaps generally the correction of (de Novo) single cell genome assembly data is not quite suitable for use with this script (yet)?

AntonJMLarsson commented 6 years ago

If you think the rate of spreading is a particular value you can try to manually set the variable on line 125 (or 176-179) if the estimation method in the script does not work to your satisfaction. I will think about adding an additional parameter so one can do this from the command line.