shah-rohan / SmartMap

Other
4 stars 2 forks source link

prepare BED files of alignments with STAR #2

Open friedue opened 3 years ago

friedue commented 3 years ago

Hi,

I'm considering to use SmartMap, but would prefer to use STAR instead of bowtie for the alignment step. Can you delineate which types of scores you expect and (how you interpret them) for the AS:i tag? And do I understand correctly that you generate separate folders for reads that align to 2,3,4 and more regions, i.e. if we have reads with 5000 hits, that would mean 5000 directories?

EDIT: Basically, I just want to get a better sense of how to re-create a BED file of alignments that will work with SmartMap.

Thank you!

Friederike

friedue commented 3 years ago

Following up on this, I've created, what I think, are BED files matching the description in your README file here (with the small caveat that your README file does not mention the "strand" field, which is part of the output by SmartMapPrep.

Just for documentation purposes, here's what I'm currently running:

$ SmartMap -m 100 -g chrLengths.txt -o myTest splits/

## here's what the one of the BED files looks like
$ cat splits/test_map-10.bed 
chr1    48498420        48498419        SRR10195006.189 +       AS:i:73 YS:i:73
chr10   3625016 3625015 SRR10195006.189 +       AS:i:73 YS:i:73
chr10   106316624       106316623       SRR10195006.189 +       AS:i:73 YS:i:73
chr14   92387163        92387162        SRR10195006.189 +       AS:i:73 YS:i:73
chr18   90277345        90277344        SRR10195006.189 +       AS:i:73 YS:i:73
chr6    109070828       109070827       SRR10195006.189 +       AS:i:73 YS:i:73
chr17   54912751        54912750        SRR10195006.189 +       AS:i:73 YS:i:73
chr7    59065216        59065215        SRR10195006.189 +       AS:i:73 YS:i:73
chr5    63603854        63603853        SRR10195006.189 +       AS:i:73 YS:i:73
chr6    56508593        56508592        SRR10195006.189 +       AS:i:73 YS:i:73
chr16   6342380 6342379 SRR10195006.626 +       AS:i:73 YS:i:73
chr4    55418228        55418227        SRR10195006.626 +       AS:i:73 YS:i:73
chrY    50040361        50040360        SRR10195006.626 +       AS:i:73 YS:i:73
chr4    55422936        55422935        SRR10195006.626 +       AS:i:73 YS:i:73
chrY    77490698        77490697        SRR10195006.626 +       AS:i:73 YS:i:73
chrY    70902504        70902503        SRR10195006.626 +       AS:i:73 YS:i:73
chrY    62660096        62660095        SRR10195006.626 +       AS:i:73 YS:i:73
chrY    70897877        70897876        SRR10195006.626 +       AS:i:73 YS:i:73
chr14   91647339        91647338        SRR10195006.626 +       AS:i:73 YS:i:73
chrY    77486070        77486069        SRR10195006.626 +       AS:i:73 YS:i:73
chr1    102547294       102547293       SRR10195006.745 +       AS:i:74 YS:i:74
chr4    26461528        26461527        SRR10195006.745 +       AS:i:74 YS:i:74
chr4    59966076        59966075        SRR10195006.745 +       AS:i:74 YS:i:74
chrX    46991250        46991249        SRR10195006.745 +       AS:i:74 YS:i:74
chr10   6550281 6550280 SRR10195006.745 +       AS:i:74 YS:i:74
chr3    135195591       135195590       SRR10195006.745 +       AS:i:74 YS:i:74
chr13   106225578       106225577       SRR10195006.745 +       AS:i:74 YS:i:74
chr14   38036760        38036759        SRR10195006.745 +       AS:i:74 YS:i:74
chr1    45106498        45106497        SRR10195006.745 +       AS:i:74 YS:i:74
chr4    70585296        70585295        SRR10195006.745 +       AS:i:74 YS:i:74
chr4    13055720        13055719        SRR10195006.562 -       AS:i:73 YS:i:73
chr5    60773017        60773016        SRR10195006.562 -       AS:i:73 YS:i:73
chrX    114697080       114697079       SRR10195006.562 -       AS:i:73 YS:i:73
chr9    81464996        81464995        SRR10195006.562 -       AS:i:73 YS:i:73
chr2    16961773        16961772        SRR10195006.562 -       AS:i:73 YS:i:73
chr17   16023600        16023599        SRR10195006.562 -       AS:i:73 YS:i:73
chr14   60338890        60338889        SRR10195006.562 -       AS:i:73 YS:i:73
chrX    143157135       143157134       SRR10195006.562 -       AS:i:73 YS:i:73
chr14   36214192        36214191        SRR10195006.562 -       AS:i:73 YS:i:73
chr1    80085965        80085964        SRR10195006.562 -       AS:i:73 YS:i:73
shah-rohan commented 3 years ago

Hi,

The scores here are the internal Bowtie2 scores used to measure the quality of an alignment; a higher (less negative) score is indicative of a better alignment. The AS:i: score measures the quality of the alignment of the read and the YS:i: score measures the quality of the mate.

There are not separate directories for each number of alignments per read -- just one output file!

In your file here, you have the strand in the fifth column. If you want to do a strand-specific analysis (like RNA-seq) where you analyze both strands separately and independently, you need to include the -S flag in your SmartMap command, something like: SmartMap -S -m 100 -g chrLengths.txt -o myTest splits/*.bed

If, on the other hand, you don't want a strand-specific analysis (e.g. ChIP-seq), get rid of the fifth field from the BED files and use the command SmartMap -m 100 -g chrLengths.txt -o myTest [whatever your bed directory is]/*.bed

Hope that helps!

friedue commented 3 years ago

Ah, right, I was looking at the SmartMapRNAPrep script, which is where the strand was coming from.

Is the alignment score taken into account for the weighting down the road?

shah-rohan commented 3 years ago

If you are using scored mode (where you set the minimum score with the -s flag (recommended at -0.6*(total read length + 1)), then this alignment score is taken into account for weighting. If you do not enable the -s option, then it is not taken into account. We recommend using scored mode -- we find that it has lower error than does unscored mode.

friedue commented 3 years ago

and how would the score mode be affected by the absence of mates for single-end data?

shah-rohan commented 3 years ago

I haven’t tested it with single-end data so I don’t know how it will influence the outcome – a priori, I can’t imagine it would be a positive impact relative to paired-end data, but I can’t say for certain.

In the absence of a mate, there won't be a YS column provided by the aligner. In this case, the YS column is required by SmartMap insofar as it has to exist, but if you just append a field YS:i:0 to each line of the BED file then it should be alright. The minimum score parameter (flag -s) will then be -0.6*(read length + 1) where read length is the SE read length.

friedue commented 3 years ago

Let me rephrase: why is the YS:i field needed? How is it taken account for the scoring (or any other part of the algorithm)?